MemSQL Streamliner will be deprecated in MemSQL 6.0. For current Streamliner users, we recommend migrating to MemSQL Pipelines instead. MemSQL Pipelines provides increased stability, improved ingest performance, and exactly-once semantics. For more information about Streamliner deprecation, see the 5.8 Release Notes. For more information about Pipelines, see the MemSQL Pipelines documentation.
Starting with MemSQL 4.1, MemSQL introduced Streamliner, an integrated MemSQL and Spark solution. Streamliner allows users to set up real-time data pipelines. It extracts and transforms the data through Apache Spark, and loads it into MemSQL.
MemSQL Streamliner provides a simple interface for creating and managing real-time data pipelines. It offers a versatile set of tools with applications ranging from development and testing to managing real-time pipelines in production. Streamliner is built on top of MemSQL, MemSQL Ops, and Apache Spark. You can use Streamliner both through the Spark tab in the MemSQL Ops web interface and through the MemSQL Ops CLI.
Ready to get started with Streamliner? Learn more about it below or go straight to the Streamliner Quick Start guide to build your first real-time data pipeline in under 10 minutes!
Streamliner Benefits
In addition to saving time by automating much of the work associated with building and maintaining data pipelines, Streamliner offers several technical advantages over a home-rolled solution built on Spark:
-
Streamliner provides a single unified interface for managing many pipelines, and allows you to start and stop individual pipelines without affecting other pipelines running concurrently.
-
Streamliner offers built-in developer tools that dramatically simplify developing, testing, and debugging data pipelines. For instance, Streamliner allows the user to trace individual batches all the way through a pipeline and observe the input and output of every stage.
-
Streamliner handles the challenging aspects of distributed real-time data processing, allowing developers to focus on data processing logic rather than low level technical considerations. Under the hood, Streamliner leverages MemSQL and Apache Spark to provide fault tolerance and transactional semantics without sacrificing performance.
-
The modularity of Streamliner, which separates pipelines into Extract, Transform, and Load phases, facilitates code reuse. With thoughtful design, you can mix, match, and reuse Extractors and Transformers.
-
Out of the box Streamliner comes with built-in Extractors, such as the Kafka Extractor, and Transformers, such as a CSV parser and JSON emitter. Even if you find you need to develop custom components, the built-in pipelines make it easy to start testing without writing much or any code up front.
Streamliner Components
Streamliner is an end-to-end solution comprised of the following:
- MemSQL Ops as the UI for creating and managing pipelines
- Apache Spark as the execution runtime for the pipelines
- MemSQL Spark Connector as the “software glue” between MemSQL and Spark
Spark Components
Streamliner does not require (nor is it expected) that users have detailed knowledge of the inner workings of Spark. However, users will benefit from a high level understanding of the Spark architecture.
There are four main concepts to understand the Spark architecture:
- the Master node,
- Worker nodes,
- the Driver process,
- and Executor processes.
When you install Spark using MemSQL Ops, the cluster is configured to run in “Standalone” mode. In this mode, the Master node (where node refers to a physical server, a virtual machine, or a container) houses a resource manager that tracks resource consumption and schedules job execution. In particular, the Master node manages the cluster’s Worker nodes. The resource manager on the Master node knows the state of each Worker node including, for instance, how much CPU and RAM is in use and how much is currently available.
MemSQL Spark Connector
The main Spark application process, the one which creates the SparkContext
, is called the Driver process. When you install Spark using MemSQL Ops, the Driver process runs on the Master node by default (this is also the node on which the MemSQL Master Aggregator runs). The Driver “drives” job execution: it breaks the job into smaller units of work and asks the scheduler for resources. The scheduler, running on the Master node, provisions resources, creating Executor processes on the Worker nodes. The Driver gives each of the Executors a task, the Executor completes the task, then sends its results back to the Driver.
MemSQL Streamliner runs a Spark application called the MemSQL Spark Interface. The Interface is a long-running Driver program that accepts commands from and sends information back to MemSQL Ops. Generally when using Spark Streaming, without MemSQL Streamliner, a logical “pipeline” is implemented as a Spark application, and you start and stop the pipeline by starting and stopping the Spark application. In contrast, Streamliner is pipeline manager. Starting, stopping, and making changes to a pipeline do not require stopping the entire Spark application. Rather, Streamliner enables the user to start, stop, and change pipelines using MemSQL Ops, which communicates with the MemSQL Spark Interface. This architecture is what allows Streamliner users to manage several pipelines independently.
Running Spark SQL Queries inside MemSQL
Starting in MemSQL Connector version 1.2.1, most Spark SQL queries written as part of a Spark application that uses the MemSQL Spark Connector will be rewritten into raw SQL queries and run directly against the MemSQL engine. In most cases this results in massive performance gains and memory reduction. This feature is called SQL Pushdown because the query is pushed from Spark down into the underlying MemSQL Engine.
Out of the box queries running through the MemSQLContext’s .sql
method and dataframe logical operations will automatically be pushed down during execution. For example, all of the following operations will be rewritten into single MemSQL queries resulting in massive performance gains:
val msc = MemSQLContext(sparkContext)
// Dataframe operations
msc.table("foo").count
msc.table("foo").groupBy("bar").sum("baz").collect
msc.table("foo").select(($"bar" + 5).as("bigBar"))
.filter($"baz" > 10 || $"bigBar" === 10)
.collect
// Spark SQL Queries
msc.sql("select count(*) from foo").collect
msc.sql("select sum(baz) from foo group by bar").collect
msc.sql("select (bar + 5) as bigBar from foo where baz > 10 or bigBar === 10").collect
See frequently asked questions about the SQL Pushdown feature in the MemSQL FAQ section.
Streamliner Pipeline Phases
Streamliner has three distinct phases - Extract, Transform, and Load. These three phases are similarly named as those of a typical batch process, however the difference is that these phases are all processed in real-time. The Extract, Transform and Load phases occur sequentially in a real-time data pipeline, with one phase feeding information into the next.
The Extract phase consumes the data from a real-time data source. The most common use case is consumption from Apache Kafka queues, but users can also consume from other pre-built and user-defined data sources.
The Transform phase transforms data from the Extract source into data stored in MemSQL. In the transform phase, you can enrich the data while leveraging the full set of Spark functionality.
The Load phase stores data into a MemSQL table. If the target database table does not exist, it is automatically created.
Learn more about all the Streamliner phases in the Streamliner Pipelines section.