Outdated Version

You are viewing an older version of this section. View current production version.

Working with Azure Blob Pipelines min read


Alert

Azure Blob Pipelines Requires MemSQL 5.8.5 or above.

Azure Pipeline Syntax Examples

The following syntax demonstrates how to create a new Azure Pipeline. For complete syntax documentation, see CREATE PIPELINE.

Example 1 – Read all objects in a container using account name and account key credentials for CSV files:

CREATE PIPELINE library
AS LOAD DATA AZURE 'my-container-name'
CONFIG '{"disable_gunzip": "true", "suffixes": ["csv"]}'
CREDENTIALS '{"account_name": "your_account_name", "account_key":
"your_key"}'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';

Authentication

An Azure pipeline must authenticate with Azure before it can begin reading blobs from a container. The pipeline requires you to provide an account name and account access key.

The CREDENTIALS clause should be a JSON object with two fields:

account_name: this is the account name under which your blob container resides. This is usually a human-readable name, given by the person who created the account.

account_key: usually an 88 character 512-bit string linked to a storage account.

See the Azure documentation about viewing and managing Azure Access Keys to learn more.

Permissions and Policies

An Azure pipeline can be created to read from either a container or a blob. Both of these resource types may be configured with access policies or permissions. Before creating an Azure pipeline, it’s important to consider both the provided user credentials and any existing policies or permissions on the desired resource.

For example, if you provide credentials that implicitly grant access to all blobs in a container, you may not need to add or modify any access policies or permissions on the desired resource. However, if you provide credentials that do not have permission to access the resource, an administrator will need to allow access for your credentials.

Consider the following scenarios:

  • Read all objects in a container: An Azure pipeline configured for a container will automatically read all blobs it has access to. Changes to a container-level policy or permissions may be required.

  • Read all blobs with a given prefix: An Azure pipeline configured to read all objects in a container with a given prefix may require changes to a container-level policy with a prefix in order to allow access the desired blobs.

  • Read a specific object in a container: An Azure pipeline configured for a specific blob may require changes to the policies and permissions for both the blob and container.

Azure Pipeline Batches

When the master aggregator reads a container’s contents, it schedules a subset of the objects for ingest across all database partitions. After each each partition across the cluster has finished extracting, transforming, and loading its object, a batch has been completed. Therefore, an Azure pipeline batch is defined as a cluster-level operation where each partition processes at most a single object from the source container.

Consider the following example: There are 4 objects in a source container. If your cluster has 2 leaf nodes that have 2 partitions each (4 partitions total), all of the container’s objects can be ingested in 1 batch. In the same cluster, if there are 40 objects in the source container, it will take 10 batches to fully ingest the data.

Information about recent batches can be found in information_schema.PIPELINES_BATCHES_SUMMARY.

Information about files to be loaded can be found in information_schema.PIPELINES_FILES.