GCS Pipelines Quickstart

GCS Pipeline Quickstart

To create and interact with an GCS Pipeline quickly, follow the instructions in this section.

Prerequisites

To complete this Quickstart, your environment must meet the following prerequisites:

  • Operating System: Mac OS X or Linux
  • Docker: Version 1.12 or newer. These instructions are written for Docker for Mac. Docker Toolbox is compatible as well, but no instructions are provided. While Docker is required for this Quickstart, Pipelines and MemSQL itself have no dependency on Docker.
  • GCS Account: This Quickstart uses Google Cloud Storage and requires a google access_id and secret_key.
  • MemSQL Version: Must be 7.0.14 or newer.

Part 1: Creating a GCS Bucket and Adding a File

The first part of this Quickstart involves creating a new bucket in yourG GCS account, and then adding a simple file into the bucket. You can create a new bucket using a few different methods, but the following steps use the browser-based GCS Management Console.

Note: The following steps assume that you have previous experience with Google Cloud Storage. If you are unfamiliar with this service, see the Google Cloud Storage Documentation.

  1. On your local machine, create a text file with the following CSV contents and name it books.txt:
The Catcher in the Rye, J.D. Salinger, 1945
Pride and Prejudice, Jane Austen, 1813
Of Mice and Men, John Steinbeck, 1937
Frankenstein, Mary Shelley, 1818
  1. Open the Cloud Storage browser in the Google Cloud Console.
  2. Click Create Bucket to open the bucket creation form.
  3. Enter your bucket information and click Continue to complete each step: a. Specify a Name, subject to the bucket name requirements. b. Select a Default storage class for the bucket. The default storage class will be assigned by default to all objects uploaded to the bucket. c. Next, select a location where the bucket data will be permanently stored. d. Select an Access control model to determine how you control access to the bucket’s objects. Create an HMAC key for authentication (MemSQL pipelines only supports HMAC keys). e. Click Done.
  4. To upload a file, click on the name of the bucket that you want to upload an object to.
  5. In the Objects tab for the bucket, either: a. Drag and drop the desired files from your desktop or file manager to the main pane in the Cloud Console. b. Click the Upload Files button, select the files you want to upload in the dialog that appears, and click Open.

Part 2: Creating a MemSQL Database and GCS Pipeline in Docker

Now that you have a GCS bucket that contains an object (file), you can use MemSQL to create a new pipeline and ingest the messages. In this part of the Quickstart, you will create a Docker container to run MemSQL and then create a new GCS pipeline.

In a new terminal window, execute the following command:

docker run --name memsql -p 3306:3306 -p 9000:9000 memsql/quickstart

This command automatically downloads the memsql/quickstart Docker image from Docker Hub, creates a new container using the image, assigns the container a user-friendly name (memsql), and finally starts the container.

You will see a number of lines outputted to the terminal as the container initializes and MemSQL starts. Once the initialization process is complete, open a new terminal window and execute the following command:

docker exec -it memsql memsql

This command accesses the SingleStore client within the Docker container. Now create a new database and a table that adheres to the schema contained in books.txt file. At the MemSQL prompt, execute the following statements:

CREATE DATABASE books;
CREATE TABLE classic_books
(
title VARCHAR(255),
author VARCHAR(255),
date VARCHAR(255)
);

These statements create a new database named books and a new table named classic_books, which has three columns: title, author, and date.

Now that the destination database and table have been created, you can create a GCS pipeline. In Part 1 of this Quickstart, you uploaded the books.txt file to your bucket. To create the pipeline, you will need the following information:

  • The name of the bucket, such as: my-bucket-name
  • Your Google account’s access HMAC keys, such as: Access Key ID: your_access_key_id Secret Access Key: your_secret_access_key

Using these identifiers and keys, execute the following statement, replacing the placeholder values with your own:

CREATE PIPELINE library
AS LOAD DATA GCS 'my-bucket-name/books.tsv'
CREDENTIALS '{"access_id": "your_access_key_id", "secret_key": "your_secret_access_key"}'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';

You can see what files the pipeline wants to load by running the following:

SELECT * FROM information_schema.PIPELINES_FILES;

If everything is properly configured, you should see one row in the Unloaded state, corresponding to books.txt. The CREATE PIPELINE statement creates a new pipeline named library, but the pipeline has not yet been started, and no data has been loaded. A MemSQL pipeline can run either in the background or be triggered by a foreground query. Start it in the foreground first.

START PIPELINE library FOREGROUND;

When this command returns success, all files from your bucket will be loaded. If you check information_schema.PIPELINES_FILES again, you should see all files in the Loaded state. Now query the classic_books table to make sure the data has actually loaded.

SELECT * FROM classic_books;
****
+------------------------+-----------------+-------+
| title                  | author          | date  |
+------------------------+-----------------+-------+
| The Catcher in the Rye |  J.D. Salinger  |  1945 |
| Pride and Prejudice    |  Jane Austen    |  1813 |
| Of Mice and Men        |  John Steinbeck |  1937 |
| Frankenstein           |  Mary Shelley   |  1818 |
+------------------------+-----------------+-------+

You can also have MemSQL run your pipeline in the background. In such a configuration, MemSQL will periodically poll GCS for new files and continuously them as they are added to the bucket. Before running your pipeline in the background, you must reset the state of the pipeline and the table.

DELETE FROM classic_books;
ALTER PIPELINE library SET OFFSETS EARLIEST;

The first command deletes all rows from the target table. The second causes the pipeline to start from the beginning, in this case, “forgetting” it already loaded books.txt so you can load it again. You can also drop and recreate the pipeline, if you prefer.
To start a pipeline in the background, run

START PIPELINE library;

This statement starts the pipeline. To see whether the pipeline is running, run SHOW PIPELINES.

SHOW PIPELINES;
****
+----------------------+---------+
| Pipelines_in_books   | State   |
+----------------------+---------+
| library              | Running |
+----------------------+---------+

At this point, the pipeline is running and the contents of the books.txt file should once again be present in the classic_books table.

Info

Foreground pipelines and background pipelines have different intended uses and behave differently. For more information, see the START PIPELINE topic.

Next Steps

Now that you have a running pipeline, any new files you add to your bucket will be automatically ingested. To understand how a GCS pipeline ingests large amounts of objects in a bucket, see the Parallelized Data Loading section in the Extractors topic. You can also learn more about how to transform the ingested data by reading the Transforms topic.