You are viewing an older version of this section. View current production version.
Extractors
An extractor is responsible for pulling data from an external system into a MemSQL Database. Extractors are a built-in component of Pipelines; they are not installed independently. Currently, you can only use the extractors that MemSQL provides.
When you create a new pipeline, the extractor is specified in the LOAD DATA
statement. For example:
CREATE PIPELINE mypipeline AS
LOAD DATA KAFKA '192.168.1.100:9092/my-topic'
INTO TABLE t;
The following statement creates an S3 pipeline:
CREATE PIPELINE library AS
LOAD DATA S3 'my-bucket-name'
CREDENTIALS '{"aws_access_key_id": "your_access_key_id", "aws_secret_access_key": "your_secret_access_key"}'
INTO TABLE t;
Supported Extractor Data Sources
Data Source | Data Source Version | MemSQL Version |
---|---|---|
Apache Kafka | 0.8.2.2 or newer | 5.5.0 or newer |
Amazon S3 | N/A | 5.7.0 or newer |
Parallelized Data Loading
Data is extracted from a source in parallel to ensure high performance. The specific details of parallelization depend on the source’s partitioning architecture, but there are a few general rules:
- A pipeline pairs n number of source partitions with p number of MemSQL leaf node partitions.
- Each leaf node partition runs its own extraction process independently of other leaf nodes and their partitions.
- Extracted data is stored on the leaf node where a partition resides until it can be written to the destination table. Depending on the way your table is sharded, the extracted data may only temporarily be stored on this leaf node.
Data Loading for Kafka Pipelines
For Kafka pipelines, there should be a 1:1 relationship between the number of leaf node partitions and the number of Kafka partitions. For example, if your database has two leaves with eight partitions each, your Kafka cluster should have 16 partitions. If the database or the data source’s partitions aren’t equal in number, leaf nodes will either sit idle or will process uneven amounts of data. However, even in scenarios when leaf nodes are processing an uneven amount of data, ingestion using Pipelines will generally be more performant than parallel loading through aggregator nodes.
Data Loading for S3 Pipelines
For S3 pipelines, each leaf node partition will process a single object from the source bucket in a batch. For example, if your cluster has 16 partitions, and the source bucket contains 16 objects, the 1:1 relationship between objects and partitions means that every object in the bucket will be ingested at the same time. Once each partition has finished processing its object, the batch will be complete. For more information on batches in S3, see S3 Pipeline Batches and Offsets.
If the source bucket contains objects that greatly differ in size, it’s important to understand how an S3 pipeline’s performance may be affected. Consider two partitions on a leaf: partition1
is processing an object that is 1KB in size, while partition2
is processing an object that is 10 MB in size. partition1
will finish processing its object sooner than partition2
. In this case, partition1
will sit idle and will not extract the next object from the bucket until partition2
finishes processing its 10 MB object. New objects will be processed only when partition1
and partition2
are both finished processing their respective objects.