Extractors
An extractor is responsible for pulling data from an external system into MemSQL Database. Extractors are a built-in component of Pipelines, they are not installed independently. Currently, you can only use the extractors that MemSQL provides.
When you create a new pipeline, the extractor is specified in the LOAD DATA
statement. For example, the following statement creates a Kafka pipeline:
CREATE PIPELINE mypipeline AS
LOAD DATA KAFKA '192.168.1.100:9092/my-topic'
INTO TABLE t
Supported Extractor Data Sources
Data Source | Data Source Version | MemSQL Version |
---|---|---|
Apache Kafka | 0.8.2.2 or newer | 5.5.0 or newer |
Parallelized Data Loading
Data is extracted from a source in parallel to ensure high performance. The specific details of parallelization depend on the source’s partitioning architecture, but there are a few general rules:
- A pipeline pairs n number of source partitions with p number of MemSQL leaf node partitions.
- Each leaf node partition runs its own extraction process independently of other leaf nodes and their partitions.
- Extracted data is stored on the leaf node where a partition resides until it can be written to the destination table. Depending on the way your database is sharded, the extracted data may only temporarily be stored on this leaf node.
As a best practice, there should be a 1:1 relationship between the number of leaf node partitions and the number of partitions in the data source. For example, if your database has two leaves with eight partitions each, your data source should have 16 partitions. If the database or the data source’s partitions aren’t equal in number, leaf nodes will either sit idle or will process uneven amounts of data. However, even in scenarios when leaf nodes are processing an uneven amount of data, ingestion will generally be much more highly performant than parallel loading through aggregator nodes.