Outdated Version

You are viewing an older version of this section. View current production version.

Filesystem Pipelines Overview

Alert
Filesystem Pipelines Requires MemSQL 5.8.5 or above.

MemSQL Pipelines can extract files from a Linux filesystem directory, optionally transform them, and insert them into a destination table. To get started quickly using a filesystem pipeline, see Filesystem Pipelines Quickstart.

Example Filesystem Pipeline Scenario

Imagine you have a software service producing log files that you wish to load into MemSQL. The log files are dropped to a file server on another machine that is mounted as an NFS drive on your MemSQL cluster. The drive contains a large number of files in CSV or TSV format, either as flat files are zipped (the filesystem extractor will automatically unzip files with a .gz extension). New files are being generated every minute, and you need a persistent, reliable, and automatic method of loading their contents into a database.

A Filesystem Pipeline is the ideal solution to this scenario. Consider a MemSQL cluster with one master aggregator node, one child aggregator node, and two leaf nodes. Given this topology, the following steps occur when you start a Filesystem Pipeline:

  1. The MemSQL cluster’s master aggregator accesses the specified directory and authenticates if necessary.
  2. The master aggregator reads the contents of the directory, divides the number of files in the directory by the number of leaf partitions in the cluster, and assigns a portion of the files to each leaf.
  3. Each leaf node begins streaming their assigned files from the directory, and authenticating if necessary. If a transform executable was specified when the pipeline was created, the leaf node will transform the file’s data as it is streamed from the filesystem.
  4. The file’s records are streamed into the destination table immediately after extraction and optional transformation.

As new files are added to the source directory, the master aggregator constantly monitors it to ensure that the files will be processed automatically.

Filesystem Pipeline Syntax Examples

The following syntax demonstrates how to create a new Filesystem Pipeline. For complete syntax documentation, see CREATE PIPELINE.

Example 1 – Read all files in a directory for CSV files:

CREATE PIPELINE library
AS LOAD DATA FS '/path/to/files/*'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';

Filesystem Paths and Permissions

The paths used by the Unix filesystem extractor address files which must be accessible from every node in the cluster. The most common way to use this extractor is with a distributed filesystem, such as NFS. The path provided to a Unix filesystem extractor must be absolute, and will usually include a Unix glob pattern which matches many files, as seen in the example above. In order for the Unix filesystem extractor to load data, the files must grant read permissions to the MemSQL Linux user.

Filesystem Pipeline Batches and Offsets

Metadata and configuration information for every pipeline is written to tables in the information_schema database. These tables use the terms batch and offset, which are standard terms across all MemSQL pipelines, but have the following definitions in the context of Filesystem pipelines:

Batch

When the master aggregator reads a directory’s contents, it schedules each file on a single MemSQL partition. After each leaf partition across the cluster has finished extracting, transforming, and loading its file, a batch has been completed. Therefore, a Filesystem pipeline batch is defined as a cluster-level operation where each node processes a single file from the source directory.

Consider the following example: There are 4 files in a directory. If your cluster has 2 leaf nodes that have 2 partitions each (4 partitions total), all of the files can be ingested in one batch. In the same cluster, if there are 40 files in the source directory, it will take 10 batches to fully ingest the data.

Offset

For Filesystem Pipelines, an offset simply represents the start and end of a single file with the following integer values:

  • 0, which represents the start of the file
  • 1, which represents the end of the file

If you query the information_schema.PIPELINES_BATCHES table, all successfully loaded batches will simply state the following values for the earliest and latest batch offsets:

BATCH_EARLIEST_OFFSET: 0
BATCH_LATEST_OFFSET: 1