You are viewing an older version of this section. View current production version.
Filesystem Pipelines Overview
The filesystem pipelines feature requires MemSQL 5.8.5 or above.
The filesystem pipelines feature is not supported by the MemSQL Operator or the MemSQL database engine in Kubernetes environments.
MemSQL Pipelines can extract files from a Linux filesystem directory, optionally transform them, and insert them into a destination table. To get started quickly using a filesystem pipeline, see Filesystem Pipelines Quickstart.
Example Filesystem Pipeline Scenario
Imagine you have a software service producing log files that you wish to load into MemSQL. The log files are dropped to a file server on another machine that is mounted as an NFS drive on your MemSQL cluster. The drive contains a large number of files in CSV or TSV format, either as flat files are zipped (the filesystem extractor will automatically unzip files with a .gz
extension). New files are being generated every minute, and you need a persistent, reliable, and automatic method of loading their contents into a database.
A Filesystem Pipeline is the ideal solution to this scenario. Consider a MemSQL cluster with one master aggregator node, one child aggregator node, and two leaf nodes. Given this topology, the following steps occur when you start a Filesystem Pipeline:
- The MemSQL cluster’s master aggregator accesses the specified directory and authenticates if necessary.
- The master aggregator reads the contents of the directory, divides the number of files in the directory by the number of leaf partitions in the cluster, and assigns a portion of the files to each leaf.
- Each leaf node begins streaming its assigned files from the directory, and authenticating if necessary. If a transform executable was specified when the pipeline was created, the leaf node will transform the file’s data as it is streamed from the filesystem.
- The file’s records are streamed into the destination table immediately after extraction and optional transformation.
As new files are added to the source directory, the master aggregator constantly monitors it to ensure that the files will be processed automatically.
Filesystem Pipeline Syntax Examples
The following syntax demonstrates how to create a new Filesystem Pipeline. For complete syntax documentation, see CREATE PIPELINE.
Example 1 – Read all files in a directory for CSV files:
CREATE PIPELINE library
AS LOAD DATA FS '/path/to/files/*'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';
Filesystem Paths and Permissions
The paths used by the Unix filesystem extractor address files which must be accessible from every node in the cluster. The most common way to use this extractor is with a distributed filesystem, such as NFS. The path provided to a Unix filesystem extractor must be absolute, and will usually include a Unix glob pattern which matches many files, as seen in the example above. In order for the Unix filesystem extractor to load data, the files must grant read permissions to the MemSQL Linux user.
Filesystem Pipeline Batches
Metadata and configuration information for every pipeline can be found in tables in the information_schema
database. These tables use the term batch, which is a standard term across all MemSQL pipelines, but have the following definitions in the context of Filesystem pipelines.
When the master aggregator reads a container’s contents, it schedules a subset of the objects for ingest across all database partitions. After each each partition across the cluster has finished extracting, transforming, and loading its object, a batch has been completed. Therefore, a Filesystem pipeline batch is defined as a cluster-level operation where each partition processes at most a single object from the source container.
Consider the following example: There are 4 objects in a source container. If your cluster has 2 leaf nodes that have 2 partitions each (4 partitions total), all of the container’s objects can be ingested in 1 batch. In the same cluster, if there are 40 objects in the source container, it will take 10 batches to fully ingest the data.
Information about recent batches can be found in information_schema.PIPELINES_BATCHES_SUMMARY
.
Information about files to be loaded can be found in information_schema.PIPELINES_FILES
.