You are viewing an older version of this section. View current production version.
CREATE PIPELINE
The CREATE PIPELINE
clause creates a new pipeline in a MemSQL database. The complete syntax with all possible clauses is shown below:
CREATE PIPELINE [IF NOT EXISTS] pipeline_name AS
LOAD DATA { kafka_configuration | s3_configuration }
[BATCH_INTERVAL milliseconds]
[WITH TRANSFORM ('uri', 'executable', 'arguments [...]') ]
[REPLACE | IGNORE | SKIP { ALL | CONSTRAINT | DUPLICATE KEY } ERRORS]
INTO TABLE table_name
[FIELDS | COLUMNS]
[TERMINATED BY 'string'
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
]
[LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
]
[IGNORE number LINES]
[ (column_name, ... ) ]
[ON DUPLICATE KEY UPDATE column_name = expression, [...]]
kafka_configuration:
KAFKA 'kafka_topic_endpoint'
s3_configuration:
S3 { 'bucket-name' | 'bucket-name/object-name' | 'bucket-name/prefix/object-name' }
CREDENTIALS 'credentials_json'
[CONFIG 'configuration_json']
Kafka Pipeline Syntax
The following example statement demonstrate how to create a Kafka pipeline using the minimum required syntax:
Minimum Required S3 Pipeline Syntax:
CREATE PIPELINE mypipeline AS
LOAD DATA KAFKA '127.0.0.1/my-topic'
INTO TABLE `my_table`;
START PIPELINE mypipeline;
This statement creates a new pipeline named my_pipeline
, uses a Kafka cluster as the data source, points to the location of the my-topic
topic at the Kafka cluster’s endpoint, and will start ingesting data into my_table
. For more information about Kafka Pipelines, see Kafka Pipelines Overview. For more information on how pipelines can be started and run, see START PIPELINE.
S3 Pipeline Syntax
The following example statement demonstrate how to create an S3 pipeline using the minimum required syntax:
Minimum Required S3 Pipeline Syntax:
CREATE PIPELINE mypipeline AS
LOAD DATA S3 'my-bucket-name'
CREDENTIALS '{"aws_access_key_id": "your_access_key_id", "aws_secret_access_key": "your_secret_access_key"}'
INTO TABLE `my_table`
START PIPELINE mypipeline;
This statement creates a new pipeline named my_pipeline
, uses an S3 bucket named my-bucket-name
as the data source, and will start ingesting the bucket’s objects into my_table
. For more information about S3 Pipelines, see S3 Pipelines Overview. For more information on how pipelines can be started and run, see START PIPELINE.
Note that no CONFIG
clause is required to create an S3 pipeline. This clause is used to specify the Amazon S3 region where the source bucket is located. If no CONFIG
clause is specified, MemSQL will automatically use the us-east-1
region, also known as US Standard
in the Amazon S3 console. To specify a different region, such as us-west-1
, include a CONFIG
clause as shown:
S3 Pipeline Using Specified Region:
CREATE PIPELINE mypipeline AS
LOAD DATA S3 'my-bucket-name'
CREDENTIALS '{"aws_access_key_id": "your_access_key_id", "aws_secret_access_key": "your_secret_access_key"}'
CONFIG '{"region": "us-west-1"}'
INTO TABLE `my_table`
Each of the clauses in a CREATE PIPELINE
statement are described below.
LOAD DATA
The CREATE PIPELINE
clause uses standard LOAD DATA
syntax options, including its error handling options. For more information, see the LOAD DATA page.
AS LOAD DATA
: You can load data by specifying the data source as a Kafka cluster and topic, an S3 bucket or object (with optional prefix), or a file.
-
AS LOAD DATA KAFKA 'topic_endpoint'
: To use Kafka as a data source, you must specify the endpoint for the Kafka cluster and the path to a specific topic with the cluster. For example:LOAD DATA KAFKA '127.0.0.1/my-topic'
-
AS LOAD DATA S3 'bucket-name'
: To use S3 as a data source, you must specify a bucket name or a bucket name and object name. For example:LOAD DATA S3 'my-bucket-name'
-
[BATCH_INTERVAL milliseconds]
: You can specify a batch interval in milliseconds, which is the time duration between the end of a batch operation and the start of the next one. If a batch interval is not specified, the default value is0
, which results in the extractor batching data as quickly as possible. For example:LOAD DATA KAFKA '127.0.0.1/my-topic' BATCH_INTERVAL 500
WITH TRANSFORM
Transforms are currently a preview feature of Pipelines. Their syntax and functionality may change in future MemSQL versions.
Pipeline source data can be transformed by specifying an executable program. The data is transformed after the extraction process and before it is loaded into the database. For more information, see Transforms.
WITH TRANSFORM ('uri', 'executable', 'arguments [...]')
: Each of the transform’s parameters are required, and they are described below:uri
: The transform’s URI is the location from where the executable program can be downloaded, which is specified as either anhttp://
orfile://
endpoint. If the URI points to a tarball with a.tar.gz
or.tgz
extension, its contents will be automatically extracted. Additionally, theexecutable
parameter must be specified if a theuri
is a tarball. If the URI specifies an executable file itself, theexecutable
andarguments
parameters can be empty.executable
: The filename of the transform executable to run. This parameter is required if a tarball was specified as the endpoint for the transform’surl
. If theurl
itself specifies an executable, this parameter can be empty.arguments
: A series of arguments that are passed to the transform executable at runtime. Each argument must be delimited by a space.
WITH TRANSFORM('http://memsql.com/my-transform.py','','')
WITH TRANSFORM('http://memsql.com/my-transform-tarball.tar.gz', 'my-executable.py','')
WITH TRANSFORM('http://memsql.com/my-transform-tarball.tar.gz', 'my-executable.py', '-arg1 -arg1')