Outdated Version

You are viewing an older version of this section. View current production version.

Azure Blob Pipelines Overview

Alert

Azure Blob Pipelines Requires MemSQL 5.8.5 or above.

Azure Pipeline Syntax Examples

The following syntax demonstrates how to create a new Azure Pipeline. For complete syntax documentation, see CREATE PIPELINE.

Example 1 – Read all objects in a container using account name and account key credentials for CSV files:

CREATE PIPELINE library
AS LOAD DATA AZURE 'my-container-name'
CONFIG '{"disable_gunzip": "true", "suffixes: blah}'
CREDENTIALS '{"account_name": "your_account_name", "account_key":
"your_key"}'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';

Permissions and Policies

An Azure pipeline can be created to read from either a container or a blob. Both of these resource types may be configured with access policies or permissions. Before creating an Azure pipeline, it’s important to consider both the provided user credentials and any existing policies or permissions on the desired resource.

For example, if you provide credentials that implicitly grant access to all blobs in a container, you may not need to add or modify any access policies or permissions on the desired resource. However, if you provide credentials that do not have permission to access the resource, an administrator will need to allow access for your credentials.

Consider the following scenarios:

  • Read all objects in a container: An Azure pipeline configured for a container will automatically read all blobs it has access to. Changes to a container-level policy or permissions may be required.

  • Read all blobs with a given prefix: An Azure pipeline configured to read all objects in a container with a given prefix may require changes to a container-level policy with a prefix in order to allow access the desired blobs.

  • Read a specific object in a container: An Azure pipeline configured for a specific blob may require changes to the policies and permissions for both the blob and container.

Azure Pipeline Batches

When the master aggregator reads a container’s contents, it schedules a subset of the objects for ingest across all database partitions. After each each partition across the cluster has finished extracting, transforming, and loading its object, a batch has been completed. Therefore, an Azure pipeline batch is defined as a cluster-level operation where each partition processes at most a single object from the source container.

Consider the following example: There are 4 objects in a source container. If your cluster has 2 leaf nodes that have 2 partitions each (4 partitions total), all of the container’s objects can be ingested in 1 batch. In the same cluster, if there are 40 objects in the source container, it will take 10 batches to fully ingest the data.

Information about recent batches can be found in information_schema.PIPELINES_BATCHES_SUMMARY.

Information about files to be loaded can be found in information_schema.PIPELINES_FILES.