Outdated Version

You are viewing an older version of this section. View current production version.

Filesystem Pipelines Quickstart min read


Alert

The filesystem pipelines feature requires MemSQL 5.8.5 or above.

Info

This topic does not apply to SingleStore Managed Service.

Info

The filesystem pipelines feature is not supported by the SingleStore (MemSQL) Operator or the SingleStore DB (MemSQL) database engine in Kubernetes environments.

Filesystem Pipeline Quickstart

To create and interact with a Filesystem Pipeline quickly, follow the instructions in this section.

Prerequisites

To complete this Quickstart, your environment must meet the following prerequisites:

  • SingleStore DB installation: You will connect to the database and create a pipeline to pull data from a file.

Part 1: Creating a SingleStore DB Database and Filesystem Pipeline

Once you have a directory that contains a file, you can use SingleStore DB to create a new pipeline and ingest the messages in the file. In this part of the Quickstart, you will create a new Filesystem pipeline and a sample text file, and then ingest the messages from the text file via that pipeline.

Create the test_directory directory and create a data file with the data to load into the database.

mkdir test_directory /* Create directory called 'test_directory' */

cd test_directory /* Navigate to new directory */

cat > books.txt /* Create a new file called 'books.txt' */

/* Copy the following text to the terminal: */

The Catcher in the Rye, J.D. Salinger, 1945
Pride and Prejudice, Jane Austen, 1813
Of Mice and Men, John Steinbeck, 1937
Frankenstein, Mary Shelley, 1818

Ctrl-D

For the purposes of this example, we have created the books.txt file in the test_directory in the root of the filesystem. In a typical real-world scenario, however, the filesystem extractor would be used with an NFS-mounted drive, with the source directory located on a different server.

For the next step you will create the database and table to receive the data.

memsql

CREATE DATABASE books;

USE books;

CREATE TABLE classic_books
(
title VARCHAR(255),
author VARCHAR(255),
date VARCHAR(255)
);

These statements create a new database named books and a new table named classic_books, which has three columns: title, author, and date.

Now that the destination database and table have been created, you can create a Filesystem pipeline. To create the pipeline, you will need the following information:

  • The name of the directory, such as: test_directory
CREATE PIPELINE library
AS LOAD DATA FS '/absolute/path/to/test_directory'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';

You can see what files the pipeline wants to load by running the following:

SELECT * FROM information_schema.PIPELINES_FILES;

If everything is properly configured, you should see one row in the Unloaded state, corresponding to books.txt. The CREATE PIPELINE statement creates a new pipeline named library, but the pipeline has not yet been started, and no data has been loaded. A SingleStore pipeline can run either in the background or be triggered by a foreground query. Start it in the foreground first.

START PIPELINE library FOREGROUND;

When this command returns success, all files from your directory will be loaded. If you check information_schema.PIPELINES_FILES again, you should see all files in the Loaded state. Now query the classic_books table to make sure the data has actually loaded.

memsql> SELECT * FROM classic_books;
+------------------------+-----------------+-------+
| title                  | author          | date  |
+------------------------+-----------------+-------+
| The Catcher in the Rye |  J.D. Salinger  |  1945 |
| Pride and Prejudice    |  Jane Austen    |  1813 |
| Of Mice and Men        |  John Steinbeck |  1937 |
| Frankenstein           |  Mary Shelley   |  1818 |
+------------------------+-----------------+-------+

You can also have SingleStore DB run your pipeline in background. In such a configuration, SingleStore DB will periodically poll the directory for new files and continuously them as they are added to the directory. Before running your pipeline in the background, you must reset the state of the pipeline and the table.

DELETE FROM classic_books;
ALTER PIPELINE library SET OFFSETS EARLIEST;

The first command deletes all rows from the target table. The second causes the pipeline to start from the beginning, in this case, “forgetting” it already loaded books.txt so you can load it again. You can also drop and recreate the pipeline, if you prefer.
To start a pipeline in the background, run

START PIPELINE library;

This statement starts the pipeline. To see whether the pipeline is running, run SHOW PIPELINES.

SHOW PIPELINES;
+----------------------+---------+
| Pipelines_in_books   | State   |
+----------------------+---------+
| library              | Running |
+----------------------+---------+

At this point, the pipeline is running and the contents of the books.txt file should once again be present in the classic_books table.

Info

Foreground pipelines and background pipelines have different intended uses and behave differently. For more information, see the START PIPELINE topic.

Next Steps

Now that you have a running pipeline, any new files you add to your directory will be automatically ingested. To understand how a Filesystem pipeline ingests large amounts of files in a directory, see the Parallelized Data Loading section in the Extractors topic. You can also learn more about how to transform the ingested data by reading the Transforms topic.