You are viewing an older version of this section. View current production version.
Filesystem Pipelines Overview
MemSQL Pipelines can extract files from a Linux filesystem directory, optionally transform them, and insert them into a destination table. To get started quickly using a filesystem pipeline, see Filesystem Pipelines Quickstart.
Example Filesystem Pipeline Scenario
Imagine you have a software service producing log files that you wish to load into MemSQL. The log files are dropped to a file server on another machine that is mounted as an NFS drive on your MemSQL cluster. The drive contains a large number of files in CSV or TSV format. New files are being generated every minute, and you need a persistent, reliable, and automatic method of loading their contents into a database.
A Filesystem Pipeline is the ideal solution to this scenario. Consider a MemSQL cluster with one master aggregator node, one child aggregator node, and two leaf nodes. Given this topology, the following steps occur when you start a Filesystem Pipeline:
- The MemSQL cluster’s master aggregator accesses the specified directory and authenticates if necessary.
- The master aggregator reads the contents of the directory, divides the number of files in the directory by the number of leaf partitions in the cluster, and assigns a portion of the files to each leaf.
- Each leaf node begins streaming their assigned files from the directory, and authenticating if necessary. If a transform executable was specified when the pipeline was created, the leaf node will transform the file’s data as it is streamed from the filesystem.
- The file’s records are streamed into the destination table immediately after extraction and optional transformation.
As new files are added to the source directory, the master aggregator constantly monitors it to ensure that the files will be processed automatically.
Filesystem Pipeline Syntax Examples
The following syntax demonstrates how to create a new Filesystem Pipeline. For complete syntax documentation, see CREATE PIPELINE.
Example 1 – Read all files in a directory for CSV files:
CREATE PIPELINE library AS LOAD DATA FS '/path/to/files/*' INTO TABLE `classic_books` FIELDS TERMINATED BY ',';
Filesystem Paths and Permissions
The paths used by the Unix filesystem extractor address files which must be accessible from every node in the cluster. The most common way to use this extractor is with a distributed filesystem, such as NFS. The path provided to a Unix filesystem extractor must be absolute, and will usually include a Unix glob pattern which matches many files, as seen in the example above. In order for the Unix filesystem extractor to load data, the files must grant read permissions to the MemSQL Linux user.
Filesystem Pipeline Batches and Offsets
Metadata and configuration information for every pipeline is written to tables in the
information_schema database. These tables use the terms batch and offset, which are standard terms across all MemSQL pipelines, but have the following definitions in the context of Filesystem pipelines:
When the master aggregator reads a directory’s contents, it schedules each file on a single MemSQL partition. After each leaf partition across the cluster has finished extracting, transforming, and loading its file, a batch has been completed. Therefore, a Filesystem pipeline batch is defined as a cluster-level operation where each node processes a single file from the source directory.
Consider the following example: There are 4 files in a directory. If your cluster has 2 leaf nodes that have 2 partitions each (4 partitions total), all of the files can be ingested in one batch. In the same cluster, if there are 40 files in the source directory, it will take 10 batches to fully ingest the data.
For Filesystem Pipelines, an offset simply represents the start and end of a single file with the following integer values:
0, which represents the start of the file
1, which represents the end of the file
If you query the
information_schema.PIPELINES_BATCHES table, all successfully loaded batches will simply state the following values for the earliest and latest batch offsets:
BATCH_EARLIEST_OFFSET: 0 BATCH_LATEST_OFFSET: 1