Outdated Version

You are viewing an older version of this section. View current production version.

Pipelines Information Schema Tables

There are four tables in the information_schema database that are related to Pipelines. These tables contain exhaustive information about pipeline configuration and behavior. The information in these tables is especially useful for debugging errors that may occur during the extraction, transformation, or loading phases of a pipeline’s lifecycle.

information_schema.PIPELINES Table

The PIPELINES table stores high-level information about any pipelines that have been created in the cluster. Each row represents a single pipeline. The columns in this table are described below.

information_schema.PIPELINES Table Schema

Column Name Description
DATABASE_NAME The name of the database associated with the pipeline.
PIPELINE_NAME The name of the pipeline.
CONFIG_JSON The pipeline’s configuration in JSON format. This JSON is read only, and it’s automatically generated when your pipeline configuration changes. The JSON schema for this column is described in the section below, and is intended for use in a web-based application (such as MemSQL Ops).
STATE The current state of the pipeline. Possible values are Running, Error, and Stopped. Running: The pipeline is currently running. If the pipelines_stop_on_error variable is set to ON, the pipeline has not encountered any errors during extraction, transformation, or loading. Error: The pipeline encountered an error and is currently stopped. When a pipeline is in the Error state, it must be manually started.If the pipelines_stop_on_error variable is set to OFF, a pipeline cannot enter the Error state. The pipeline will remain in the Running state until it’s manually stopped, and any errors that occur will be written to the information_schema.PIPELINES_ERRORS table. Stopped: The pipeline is currently stopped. The pipeline can only enter the Stopped state due to manual intervention.
SKIPPED_BATCHES The sum of the number of batches that have been skipped in the pipeline. Batches may be skipped if the maximum number of batch retries was reached, which is set using the pipelines_max_retries_per_batch_partition variable.

CONFIG_JSON Schema

The CONFIG_JSON column in the information_schema.PIPELINES table contains a fixed set of read-only JSON key/value pairs. Some of these JSON values can also be seen by executing the SHOW CREATE PIPELINE <pipeline-name> statement. Each key/value pair is described below.

Example CONFIG_JSON for Kafka Pipelines

{
   "name":"mypipeline",
   "source_type":"KAFKA",
   "credentials":"172.17.0.2\/test-topic",
   "batch_interval":0,
   "transform":["http:\/\/127.0.0.1:8000\/transform.py","",""],
   "dup_key_policy":null,
   "table":"messages",
   "fields_terminated_by":"\t",
   "fields_enclosed_by":"",
   "fields_escaped_by":"\\",
   "lines_terminated_by":"\n",
   "lines_starting_by":"",
   "extended_null":false,
   "column_list":null,
   "on_duplicate_key_update":null,
   "running":false
}

Example CONFIG_JSON for S3 Pipelines

{
	"name": "my-s3-pipeline",
	"source_type": "S3",
	"connection_string": "my-s3-bucket-name",
	"config": "{\"region\": \"us-west-1\"}",
	"credentials": "<CREDENTIALS REDACTED>",
	"batch_interval": 2500,
	"max_partitions_per_batch": -1,
	"transform": null,
	"load_error_policy": null,
	"dup_key_policy": null,
	"table": "my_table_name",
	"fields_terminated_by": ",",
	"fields_enclosed_by": "",
	"fields_escaped_by": "\\",
	"lines_terminated_by": "\n",
	"lines_starting_by": "",
	"extended_null": false,
	"column_list": null,
	"on_duplicate_key_update": null
}

CONFIG_JSON Schema Definition

Key Name Value Description
name The name of the pipeline.
source_type The data source type for the pipeline.
connection_string The name of the S3 bucket or bucket’s object with optional prefix.
config The configuration information provided when creating an S3 pipeline, namely the region where the source bucket is hosted.
credentials Either the Kafka topic URL for the pipeline or <CREDENTIALS REDACTED> for an S3 pipeline.
batch_interval The time duration in milliseconds between batch extraction operations.
transform The transform’s URI, executable entry point, and arguments.
load_error_policy The load error policy for the pipeline. For example, if IGNORE or SKIP ... ERRORS was specified during pipeline creation, they will appear as a JSON key/value pair like so: {"load_error_policy": "skip_all_errors"}
dup_key_policy The duplicate key policy that indicates how a row should be inserted if it contains a duplicate key value.
table The name of the table in which to insert data.
fields_terminated_by The character that terminates a field.
fields_enclosed_by The character that encloses a field.
fields_escaped_by The character that escapes a field.
lines_terminated_by The character that terminates a line.
lines_starting_by The string prefix for a line.
extended_null Specifies whether the non-quoted and case-insensitive string null will be loaded as a null type.
column_list The column list to load data into.
on_duplicate_key_update Specifies whether duplicate keys will be updated or ignored.
running Specifies whether the pipeline is currently running. Current state of the pipeline, which is either running, error, or stopped.

information_schema.PIPELINES_BATCHES_SUMMARY Table

The PIPELINES_BATCHES_SUMMARY table contains high-level information about individual batches as they are loaded into the database. The columns in this table are described below. It contains one row per batch.

information_schema.PIPELINES_BATCHES_SUMMARY Table Schema

Name Description
DATABASE_NAME The name of the database associated with the batch.
PIPELINE_NAME The name of the pipeline associated with the batch.
BATCH_ID The internal unique identifier for the batch. Values of BATCH_ID are unique in this table.
BATCH_STATE Specifies whether the batch transaction was successful. There are five possible values: Succeeded, In Progress, Failed, Queued or No Data. Succeeded: The batch was successfully loaded into the destination table. In Progress: The batch is currently being loaded into the destination table. Failed: The batch failed to be written. Batch failures are written as errors in the information_schema.PIPELINES_ERRORS table and you can use the BATCH_ID to correlate a specific batch failure with more detailed error information. Queued: The batch is waiting for enough resources in the cluster to free up so it can run. No Data: The batch contained no data. Batches that don’t contain data will be deleted as new batches are written.
START_TIME The time that the batch transaction started in DATETIME format.
BATCH_TIME The number of seconds the batch took to run, or if the batch is In Progress, the number of seconds it has been running for.
ROWS_PER_SEC Equal to ROWS_STREAMED / BATCH_TIME.
ROWS_STREAMED The number of rows extracted extracted into MemSQL. It can be higher or lower than the number of rows written, if, for instance, rows are filtered by a WHERE clause or SKIP clause, or if rows are modified by an upsert clause.
NUM_PARTITIONS The total number of batch partitions for this batch.
NUM_PARTITIONS_FINISHED The total number of finished batch partitions for this batch. A batch cannot finish until all batch partitions are finished.
MB_STREAMED The number of megabytes extracted into MemSQL.
MB_PER_SEC Equal to MB_STREAMED / BATCH_TIME.

information_schema.PIPELINES_BATCHES Table

The PIPELINES_BATCHES table contains detailed, low-level information about individual batches as they’re loaded into the database. The columns in this table are described below. It contains one row per batch partition and is useful for understanding partition-level details of a pipeline’s behavior.

information_schema.PIPELINES_BATCHES Table Schema

Name Description
DATABASE_NAME The name of the database associated with the batch.
PIPELINE_NAME The name of the pipeline associated with the batch.
BATCH_ID The internal unique identifier for the batch. While every row in the table contains unique information about a batch, multiple rows with the same BATCH_ID may be present.
BATCH_STATE Specifies whether the batch transaction was successful. There are four possible values: Succeeded, In Progress, Failed, Queued or No Data. Succeeded: The batch was successfully loaded into the destination table. In Progress: The batch is currently being loaded into the destination table. Failed: The batch failed to be written. Batch failures are written as errors in the information_schema.PIPELINES_ERRORS table, and you can use the BATCH_ID to correlate a specific batch failure with more detailed error information. Queued: The batch is waiting for enough resources in the cluster to free up so it can run. No Data: The batch contained no data. Batches that don’t contain data will be deleted as new batches are written.
BATCH_ROWS_WRITTEN The number of rows that were affected in the destination table as a result of the batch transaction. When batches are written to the destination table, they consist of insert, replace, upsert, or delete operations. The BATCH_ROWS_WRITTEN value is the sum total of all the operations that occurred when the batch was written. This value may be NULL if the BATCH_STATE is any value other than Succeeded.
BATCH_TIME The elapsed time in seconds from the start of the batch transaction to its success or failure. This value reflects the complete transaction time from the perspective of the master aggregator node. Therefore, the elapsed time may be longer than a BATCH_PARTITION_TIME value for a given batch, as BATCH_PARTITION_TIME reflects the perspective of the leaf node.
BATCH_START_UNIX_TIMESTAMP The time that the batch transaction started in Unix timestamp format.
BATCH_PARTITION_STATE Specifies whether the batch transaction was successful for a given partition. There are five possible values: Succeeded, In Progress, Failed, No Data, or NULL. Succeeded: The batch was successfully loaded into the destination table. In Progress: The batch is currently being loaded into the destination table. Failed: The batch failed to be written. Batch failures are written as errors in the information_schema.PIPELINES_ERRORS table, and you can use the BATCH_ID and PARTITION values to correlate a specific batch failure with more detailed error information. No Data: The batch contained no data. Batches that don’t contain data will be deleted as new batches are written. NULL: The batch data either contains no data, or the batch transaction hasn’t yet been executed on the partition itself.
BATCH_PARTITION_PARSED_ROWS The number of rows that were parsed during the batch transaction for a given partition. This value can be NULL if the batch transaction is still in progress or if the batch failed.
BATCH_SOURCE_PARTITION_ID The identifier for a data source’s partition from which batch partition data is extracted. This value may be NULL if the batch’s BATCH_PARTITION_STATE is also NULL. For an S3 pipeline, this represents the name of an object in the source bucket.
BATCH_EARLIEST_OFFSET Specifies the earliest offset for the batch partition. This value indicates the start of the offset range for a batch, while BATCH_LATEST_OFFSET indicates the end of the offset range. This value may be NULL if the batch’s BATCH_PARTITION_STATE is also NULL. For an S3 pipeline, this value will always be 0 when a bucket’s object has been successfully processed. For more information, see S3 Pipelines Overview
BATCH_LATEST_OFFSET Specifies the latest offset for the batch. This value indicates the end of the offset range for a batch, while BATCH_EARLIEST_OFFSET indicates the start of the offset range. This value may be NULL if the batch’s BATCH_PARTITION_STATE is also NULL. For an S3 pipeline, this value will always be 1 when a bucket’s object has been successfully processed. For more information, see S3 Pipelines Overview
BATCH_PARTITION_TIME The elapsed time in seconds from the start of the batch transaction to its success or failure. This value reflects the batch transaction time from the perspective of the leaf node. Therefore, the elapsed time may be shorter than a BATCH_TIME for a given batch, as BATCH_TIME reflects the perspective of the master aggregator node.
BATCH_PARTITION_EXTRACTED_BYTES The number of bytes that was extracted from the data source for the batch. This value may be NULL if no data was extracted for the batch.
BATCH_PARTITION_TRANSFORMED_BYTES The number of bytes that were transformed for the batch. This value may be NULL if a transform was not specified for the pipeline, or if no data was successfully transformed.
BATCH_PARTITION_EXTRACTOR_WAIT_TIME The amount of time, in seconds, that the pipeline spent waiting on the extractor to complete its operation.
BATCH_PARTITION_TRANSFORM_WAIT_TIME The amount of time, in seconds, that the pipeline spent waiting on the transform to complete its operation.
HOST The hostname or host IP address for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL. The combination of a batch’s HOST, PORT, and PARTITION identify the specific leaf node partition that attempted to load batch data.
PORT The port number for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL. The combination of a batch’s HOST, PORT, and PARTITION identify the specific leaf node partition that attempted to load batch data.
PARTITION Specifies the partition ID on a leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID is also NULL. The combination of a batch’s HOST, PORT, and PARTITION identify the specific leaf node partition that attempted to load batch data.

information_schema.PIPELINES_ERRORS Table

The PIPELINES_ERRORS table contains detailed information about errors that occurred during extraction, transformation, or loading. Each row represents a single error event. The columns in this table are described below.

Info

This table contains exhaustive node-level error information that may originate at either a leaf node or aggregator node. If a row appears to be a duplicate, first verify that the row’s ERROR_MESSAGE entry does not start with LEAF ERROR (<host>:<port>). Error event information may be propagated from a leaf to an aggregator or from a leaf to another leaf. This text indicates the specific leaf node origin for the error.

information_schema.PIPELINES_ERRORS Table Schema

Column Name Description
DATABASE_NAME The name of the database associated with the error.
PIPELINE_NAME The name of the pipeline associated with the error.
ERROR_UNIX_TIMESTAMP The time of the error event in Unix timestamp format.
ERROR_TYPE Specifies what type of error occurred. Possible values are Error or Warning. Error: An error occurred, which may have stopped the pipeline if the pipelines_stop_on_error variable is set to ON. Warning: A warning occurred, which does not stop a pipeline.
ERROR_CODE The error code for the error.
ERROR_MESSAGE The message associated with the error. This value contains contextual information about the error that can be used for debugging purposes, including a stack trace if the error was caused by a transform failure.
ERROR_KIND Specifies whether the error event was caused by internal or external factors. Possible values are Internal, Extract, Transform, or Load. Internal: An internal error occurred within the Pipelines feature itself. Extract: An error occurred during extraction from the data source. Extraction errors are typically caused by network availability or problems with the data source partitions. Transform: An error occurred during data transformation. The transform executable is typically the cause of transformation errors. Load: An error occurred when attempting to load data into the destination table. Load errors are typically caused by malformed CSV data or attempting to write invalid schema types into a column, such as a NULL value into a non-nullable column.
STD_ERROR The text that may have been outputted during the data transformation. The origin of this text is the transform executable itself. This value can be empty or NULL if the ERROR_KIND value is not Transform or if no standard error text was outputted by the transform during failure.
LOAD_DATA_LINE The text of a LOAD DATA statement that caused a parsing error while attempting to load data into the destination table. This value contains the invalid line of CSV text that failed to parse. Load errors are typically caused by malformed CSV data or attempting to write invalid schema types into a column, such as a NULL value into a non-nullable column. This value can be empty or NULL if the error didn’t occur during the loading phase of pipeline execution.
LOAD_DATA_LINE_NUMBER The line number of a LOAD DATA statement that caused a parsing error while attempting to load data into the destination table. A LOAD DATA statement may consist of many lines, and this value indicates the specific invalid line. This line number can be correlated with the value of LOAD_DATA_LINE, which contains the invalid line’s text. This value may be empty or NULL if a LOAD DATA statement wasn’t associated with the error event.
BATCH_ID The internal unique identifier for the batch that experienced an error event.
ERROR_ID The internal unique identifier for the error event.
BATCH_SOURCE_PARTITION_ID The data source’s partition ID for the batch. This value may be NULL if the error occurred on the master aggregator.
BATCH_EARLIEST_OFFSET Specifies the earliest offset for the batch. This value indicates the start of the offset range for a batch, while BATCH_LATEST_OFFSET indicates the end of the offset range. This value may be NULL if the error occurred on the master aggregator.
BATCH_LATEST_OFFSET Specifies the latest offset for the batch. This value indicates the end of the offset range for a batch, while BATCH_EARLIEST_OFFSET indicates the start of the offset range. This value may be NULL if the error occurred on the master aggregator.
HOST The hostname or host IP address for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL. The combination of a batch’s HOST, PORT, and PARTITION identify the specific leaf node partition that attempted to load batch data.
PORT The port number for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL. The combination of a batch’s HOST, PORT, and PARTITION identify the specific leaf node partition that attempted to load batch data.
PARTITION Specifies the partition ID on a leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID is also NULL. The combination of a batch’s HOST, PORT, and PARTITION identify the specific leaf node partition that attempted to load batch data.

information_schema.PIPELINES_CURSORS Table

The PIPELINES_CURSORS table contains information about a pipeline’s offset ranges. Each row represents a single data source partition. The columns in this table are described below.

information_schema.PIPELINES_CURSORS Table Schema

Column Name Description
DATABASE_NAME The name of the database associated with the offsets.
PIPELINE_NAME The name of the pipeline associated with the offsets.
SOURCE_TYPE The name of the pipeline’s data source, such as KAFKA or S3.
SOURCE_PARTITION_ID The identifier for a data source’s partition from which offsets are extracted. For an S3 pipeline, this value represents the name of a bucket’s object.
EARLIEST_OFFSET The earliest offset for an offset range. This value indicates the starting offset and LATEST_OFFSET indicates the ending offset in an offset range. This value may be NULL if data extraction is still in progress or failed.
LATEST_OFFSET The bounding offset for an offset range, exclusive. This value indicates the ending offset and EARLIEST_OFFSET indicates the starting offset in an offset range. This value may be NULL if data extraction is still in progress or failed.
CURSOR_OFFSET The next offset at which the pipeline will start loading in this partition. This number represents the end of the range which has already been loaded by this pipeline, exclusive; all future considered offsets will be beyond this.
SUCCESSFUL_CURSOR_OFFSET The offset cursor position in this source partition that has been successfully loaded. If this is lower than LATEST_LOADED_OFFSET, it implies the intermediate offsets were skipped. Note: offsets will only be skipped if pipelines_stop_on_error is false.
UPDATED_UNIX_TIMESTAMP The last updated time for an offset range in Unix timestamp format.
EXTRA_FIELDS Some extractors output additional information when extracting data from the source. For S3 pipelines, you may encounter a value of {"size":<bytes>}, where <bytes> represents the size of an object in the source bucket.
Warning

Previously, MemSQL had a similar (now deprecated) table called PIPELINES_OFFSETS, which has been replaced with PIPELINES_CURSORS. These tables contained the same information, but with a slightly different semantics. In particular, the “high offsets” in the PIPELINES_CURSORS table are exclusive instead of inclusive; these include LATEST_OFFSET, CURSOR_OFFSET, and SUCCESSFUL_CURSOR_OFFSET. Here, exclusive means that the highest offset value in a given partition has already been loaded. If these tables were inclusive, the highest offset value would be the next offset that will be loaded.

information_schema.PIPELINES_FILES Table

The PIPELINES_FILES table stores information about files that have been extracted from a filesystem-like data source, such as Amazon S3. Each row represents a single file. The columns in this table are described below.

information_schema.PIPELINES_FILES Table Schema

Column Name Description
DATABASE_NAME The name of the database associated with the file.
PIPELINE_NAME The name of the pipeline associated with the file.
SOURCE_TYPE The name of the pipeline’s data source, such as S3.
FILE_NAME The name of the file.
FILE_SIZE The size of the file in bytes.
FILE_STATE Shows the state of the file, which is either Loaded, Skipped or Unloaded.