Outdated Version

You are viewing an older version of this section. View current production version.

Kafka Pipelines Quickstart

To create and interact with a Kafka pipeline quickly, follow the instructions in this section. There are three parts to this Quickstart:

  1. Part 1: Running a Kafka Cluster in Docker
  2. Part 2: Sending Messages to Kafka
  3. Part 3: Creating a Kafka Pipeline in MemSQL

Prerequisites

To complete this Quickstart, your environment must meet the following prerequisites:

  • Operating System: Mac OS X or Linux
  • Docker: Version 1.12 or newer. If using Mac OS X, these instructions are written for Docker for Mac. Docker Toolbox is compatible as well, but no instructions are provided.

Part 1: Running a Kafka Cluster in Docker

Many different Docker images for Kafka are available on Docker Hub, but for testing purposes, one of the best is memsql/kafka. This image is ideal because it comes preconfigured with both Kafka and Zookeeper out of the box.

In a terminal window, execute the following command:

docker run --name kafka memsql/kafka

This command automatically downloads the memsql/kafka Docker image from Docker Hub, creates a new container using the image, assigns the container a user-friendly name (kafka), and finally starts the container.

You will see a number of lines outputted to the terminal as the container initializes. The most relevant lines are the last two, which only appear if the container was successfully started:

INFO success: zookeeper entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

If you see these success messages, Kafka is up and running. Leave this terminal window open and proceed to the next steps.

Now that you have a Kafka cluster running in a Docker container, you can create a topic and start sending messages to it.

Part 2: Sending Messages to Kafka

In the following steps, you will connect to the new Docker container and start interacting with Kafka. In a new terminal window, execute the following command:

docker exec -it kafka /bin/bash

The docker exec command allows you to execute commands inside a currently-running container. You’ll see a bash prompt if the command was successful:

root@780b09721ea1:/#

Kafka comes with some helper scripts that make it easy to create a new topic and start posting messages. Navigate to the scripts folder:

cd /opt/kafka*/bin

From the /opt/kafka*/bin folder, execute the following command:

./kafka-topics.sh --topic test --zookeeper 127.1:2181 --create --partitions 8 --replication-factor 1

This command uses the kafka-topics.sh script to create and configure a new topic named test. Now that you have a topic, you can create a producer that can be used to send messages to the topic:

./kafka-console-producer.sh --topic test --broker-list 127.0.0.1:9092

This command uses the kafka-console-producer.sh script to create and configure a producer that’s associated with the test topic. The script also allows you to start entering arbitrary text into standard input that will be created as Kafka messages in the topic. Enter a few messages to try it out:

the quick
brown fox
jumped over
the lazy dog

Keep this terminal window open so that you can create more messages in the future.

At this point, your Kafka cluster has a topic named test that contains a few messages. You should have two terminal windows open: one for entering messages into Kafka, and one for the Kafka container itself. In Part 3, you will create a pipeline in MemSQL to ingest these messages.

Part 3: Creating a Kafka Pipeline in MemSQL

Now that Kafka contains a topic and messages, you can use MemSQL to create a new pipeline and ingest the messages. Since this Quickstart uses Docker, you can use the memsql\quickstart Docker container to run MemSQL.

In a new terminal window, execute the following command:

docker run --name memsql -p 3306:3306 -p 9000:9000 memsql/quickstart

This command automatically downloads the memsql/quickstart Docker image from Docker Hub, creates a new container using the image, assigns the container a user-friendly name (memsql), and finally starts the container.

You will see a number of lines outputted to the terminal as the container initializes and MemSQL starts. Once the initialization process is complete, open a new terminal window and execute the following command:

docker exec -it memsql memsql

In Part 2, you used this command to access a bash shell within the Kafka container. This time, you will use it to access the MemSQL interpreter inside your new container. At the MemSQL prompt, execute the following statements:

CREATE DATABASE quickstart_kafka;
USE quickstart_kafka;
CREATE TABLE messages (id text);

These statements create a new table and database that will be used for the Kafka pipeline. But before you can create the pipeline itself, you need the IP address of the Kafka cluster inside of Docker. In a new window, execute the following command:

docker inspect -f '{{ .NetworkSettings.IPAddress }}' kafka

This command outputs the Kafka container’s IP address, such as 172.17.0.2. Copy it and go back to the MemSQL terminal window. Now that both Kafka and MemSQL are running in Docker, you can create your first pipeline. Execute the following statement, replacing <kafka-container-ip> with your own:

CREATE PIPELINE `quickstart_kafka` AS LOAD DATA KAFKA '<kafka-container-ip>/test' INTO TABLE `messages`;

This command creates a new Kafka pipeline named quickstart_kafka, which reads messages from the test topic and writes it into the messages table. If the statement was successful, you can test your pipeline. While you can start a pipeline after creating it, it’s always best to test it using a small set of data:

TEST PIPELINE quickstart_kafka LIMIT 1;

If this test was successful and no errors are present, then you are ready to try ingesting data. The following command will run one batch and commit the data to the MemSQL table messages.

START PIPELINE quickstart_kafka FOREGROUND LIMIT 1 BATCHES;

To verify that the data exists in the messages table as expected, execute the following statement. If it was successful, you should see a non-empty result set.

SELECT * FROM messages;
+--------------+
| id           |
+--------------+
| the quick    |
| brown fox    |
| jumped over  |
| the lazy dog |
+--------------+

Now you are ready to start your pipeline as a background process. MemSQL will automatically ingest new messages as they are put into Kafka.

START PIPELINE quickstart_kafka;

Now that the pipeline is up and running, send a few more messages to the Kafka topic. Go back to the terminal window from Part 2 where you created your messages. Enter the following lines and press Enter:

Lorem ipsum
dolor sit amet

In the MemSQL terminal window, run the SELECT * FROM messages; statement again. Now you will see the following output:

SELECT * FROM messages;
+----------------+
| id             |
+----------------+
| lorem ipsum    |
| dolor sit amet |
| the quick      |
| brown fox      |
| jumped over    |
| the lazy dog   |
+----------------+

Now that your pipeline is running, you can check the status and history of it at any time by querying the PIPELINES_BATCHES_SUMMARY table.

SELECT * FROM information_schema.PIPELINES_BATCHES_SUMMARY;

This system view will give one row for every recent batch the pipeline has run, as well as at-a-glance performance and volume metrics. This information is extremely valuable for monitoring and understanding your production pipeline system.

Info

Foreground pipelines and background pipelines have different intended uses and behave differently. For more information, see the START PIPELINE topic.

Quickstart Summary

In this Kafka Quickstart, you created two Docker containers: one for Kafka, and one for MemSQL. You sent multiple messages to a Kafka topic, and then created a Kafka pipeline in MemSQL to ingest the messages. This Quickstart only demonstrated the most basic functionality of a Kafka pipeline, but you can apply the same concepts to a real-world scenario.

Now that you’re familiar with using MemSQL and Kafka in Docker, you can also try the MemSQL Pipelines Twitter Demo. This demo ingests live Twitter data into MemSQL, and shows you how to perform queries that analyze user trends and sentiments.