You are viewing an older version of this section. View current production version.
HDFS Pipelines Overview
MemSQL Pipelines can extract data from HDFS. To understand HDFS core concepts and how it works, please read the Apache Hadoop HDFS documentation. This topic assumes that you understand the basic concepts and terminology of HDFS.
HDFS Pipeline Syntax Example
The following syntax demonstrates how to create a new HDFS Pipeline. For complete syntax documentation, see CREATE PIPELINE.
CREATE PIPELINE `hadoop-cluster` AS LOAD DATA HDFS 'hdfs://<namenode DNS | IP address>:<port>/<directory>' INTO TABLE `my_table` FIELDS TERMINATED BY ','
The directory location must be provided to indicate the directory from which the HDFS extractor will begin pulling data. This could be any directory, such as input or output directory. If you have subdirectories in your target directory, the extractor will recursively walk your directory tree and import additional data, if present.
Output directories have a
_SUCCESS file in them, so the example shown below uses an Hadoop client to search all directories for the
hadoop user and list those with a
$ hadoop fs -find hdfs://hadoop-namenode:8020/user/hadoop/ -name _SUCCESS -print hdfs://hadoop-namenode:8020/user/hadoop/output/_SUCCESS
Connection strings cannot start or end with whitespaces and cannot contain wildcard characters (e.g.
When extracting output files from HDFS, they will have the following naming conventions if they are the output of Hadoop, Pig, or Spark jobs:
Xis a postfix ID)
These files are written to an output directory specified during the job execution statement. The
part files can also be optionally archived with Gzip. The HDFS extractor will import (and optionally unzip) output files into the table specified in your
CREATE PIPELINE statement, but skip any
_SUCCESS files present.
HDFS Pipeline Batches
Metadata and configuration information for every pipeline can be found in tables in the
information_schema database. These tables use the term batch, which is a standard term across all MemSQL pipelines, but have the following definitions in the context of HDFS pipelines.
When the master aggregator reads a container’s contents, it schedules a subset of the objects for ingest across all database partitions. After each each partition across the cluster has finished extracting, transforming, and loading its object, a batch has been completed. Therefore, an HDFS pipeline batch is defined as a cluster-level operation where each partition processes at most a single object from the source container.
Consider the following example: There are 4 objects in a source container. If your cluster has 2 leaf nodes that have 2 partitions each (4 partitions total), all of the container’s objects can be ingested in 1 batch. In the same cluster, if there are 40 objects in the source container, it will take 10 batches to fully ingest the data.
Information about recent batches can be found in
Information about files to be loaded can be found in
The following example creates an HDFS pipeline that extracts and loads data from a simple Hadoop wordcount job and pushes it into a table named
CREATE PIPELINE `hdfs_test` AS LOAD DATA HDFS 'hdfs://hadoop-namenode:8020/user/hadoop/output' INTO TABLE `messages` FIELDS TERMINATED BY '\t';