Outdated Version

You are viewing an older version of this section. View current production version.

Azure Blob Pipelines Overview

Alert

Azure Blob Pipelines Requires MemSQL 5.8.5 or above.

Azure Pipeline Syntax Examples

The following syntax demonstrates how to create a new Azure Pipeline. For complete syntax documentation, see CREATE PIPELINE.

Example 1 – Read all objects in a container using account name and account key credentials for CSV files:

CREATE PIPELINE library
AS LOAD DATA AZURE 'my-container-name'
CONFIG '{"disable_gunzip": "true", "suffixes: blah}'
CREDENTIALS '{"account_name": "your_account_name", "account_key":
"your_key"}'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';

Permissions and Policies

An Azure pipeline can be created to read from either a container or a blob. Both of these resource types may be configured with access policies or permissions. Before creating an Azure pipeline, it’s important to consider both the provided user credentials and any existing policies or permissions on the desired resource.

For example, if you provide credentials that implicitly grant access to all blobs in a container, you may not need to add or modify any access policies or permissions on the desired resource. However, if you provide credentials that do not have permission to access the resource, an administrator will need to allow access for your credentials.

Consider the following scenarios:

  • Read all objects in a container: An Azure pipeline configured for a container will automatically read all blobs it has access to. Changes to a container-level policy or permissions may be required.

  • Read all blobs with a given prefix: An Azure pipeline configured to read all objects in a container with a given prefix. Changes to a container-level policy with a prefix may be required to allow access the desired blobs.

  • Read a specific object in a container: An Azure pipeline configured for a specific blob may require changes to the policies and permissions for both the blob and container.

Azure Pipeline Batches and Offsets

Batch

When the master aggregator reads a container’s contents, it divides the number of objects by the number of partitions across all leaf nodes. After each each leaf partition across the cluster has finished extracting, transforming, and loading its object, a batch has been completed. Therefore, an Azure pipeline batch is defined as a cluster-level operation where each node processes a single object from the source container.

Consider the following example: There are 4 objects in a source container. If your cluster has 2 leaf nodes that have 2 partitions each (4 partitions total), all of the container’s objects can be ingested in 1 batch. In the same cluster, if there are 40 objects in the source container, it will take 10 batches to fully ingest the data.

Offset

For Azure Pipelines, an offset simply represents the start and end of a single object with the following integer values:

  • 0, which represents the start of the object
  • 1, which represents the end of the object

If you query the information_schema.PIPELINES_BATCHES table, all successfully loaded batches will simply state the following values for the earliest and latest batch offsets:

BATCH_EARLIEST_OFFSET: 0
BATCH_LATEST_OFFSET: 1