You are viewing an older version of this section. View current production version.
Pipelines Information Schema Tables
There are four tables in the information_schema
database that are related to Pipelines. These tables contain exhaustive information about pipeline configuration and behavior. The information in these tables is especially useful for debugging errors that may occur during the extraction, transformation, or loading phases of a pipeline’s lifecycle.
information_schema.PIPELINES Table
The PIPELINES
table stores high-level information about any pipelines that have been created in the cluster. Each row represents a single pipeline. The columns in this table are described below.
information_schema.PIPELINES Table Schema
Column Name | Description |
---|---|
DATABASE_NAME |
The name of the database associated with the pipeline. |
PIPELINE_NAME |
The name of the pipeline. |
CONFIG_JSON |
The pipeline’s configuration in JSON format. This JSON is read only, and it’s automatically generated when your pipeline configuration changes. The JSON schema for this column is described in the section below, and is intended for use in a web-based application (such as MemSQL Ops). |
STATE |
The current state of the pipeline. Possible values are Running , Error , and Stopped .- Running : The pipeline is currently running. If the pipelines_stop_on_error variable is set to ON , the pipeline has not encountered any errors during extraction, transformation, or loading.- Error : The pipeline encountered an error and is currently stopped. When a pipeline is in the Error state, it must be manually started.If the pipelines_stop_on_error variable is set to OFF , a pipeline cannot enter the Error state. The pipeline will remain in the Running state until it’s manually stopped, and any errors that occur will be written to the information_schema.PIPELINES_ERRORS table.- Stopped : The pipeline is currently stopped. The pipeline can only enter the Stopped state due to manual intervention. |
SKIPPED_BATCHES |
The sum of the number of batches that have been skipped in the pipeline. Batches may be skipped if the maximum number of batch retries was reached, which is set using the pipelines_max_retries_per_batch_partition variable. |
CONFIG_JSON Schema
The CONFIG_JSON
column in the information_schema.PIPELINES
table contains a fixed set of read-only JSON key/value pairs. Some of these JSON values can also be seen by executing the SHOW CREATE PIPELINE <pipeline-name>
statement. Each key/value pair is described below.
Example CONFIG_JSON
for Kafka Pipelines
{
"name":"mypipeline",
"source_type":"KAFKA",
"credentials":"172.17.0.2\/test-topic",
"batch_interval":0,
"transform":["http:\/\/127.0.0.1:8000\/transform.py","",""],
"dup_key_policy":null,
"table":"messages",
"fields_terminated_by":"\t",
"fields_enclosed_by":"",
"fields_escaped_by":"\\",
"lines_terminated_by":"\n",
"lines_starting_by":"",
"extended_null":false,
"column_list":null,
"on_duplicate_key_update":null,
"running":false
}
Example CONFIG_JSON
for S3 Pipelines
{
"name": "my-s3-pipeline",
"source_type": "S3",
"connection_string": "my-s3-bucket-name",
"config": "{\"region\": \"us-west-1\"}",
"credentials": "<CREDENTIALS REDACTED>",
"batch_interval": 2500,
"max_partitions_per_batch": -1,
"transform": null,
"load_error_policy": null,
"dup_key_policy": null,
"table": "my_table_name",
"fields_terminated_by": ",",
"fields_enclosed_by": "",
"fields_escaped_by": "\\",
"lines_terminated_by": "\n",
"lines_starting_by": "",
"extended_null": false,
"column_list": null,
"on_duplicate_key_update": null
}
CONFIG_JSON Schema Definition
Key Name | Value Description |
---|---|
name |
The name of the pipeline. |
source_type |
The data source type for the pipeline. |
connection_string |
The name of the S3 bucket or bucket’s object with optional prefix. |
config |
The configuration information provided when creating an S3 pipeline, namely the region where the source bucket is hosted. |
credentials |
Either the The Kafka topic URL for the pipeline or <CREDENTIALS REDACTED> for an S3 pipeline. |
batch_interval |
The time duration in milliseconds between batch extraction operations. |
transform |
The transform’s URI, executable entry point, and arguments. |
load_error_policy |
The load error policy for the pipeline. For example, if IGNORE or SKIP ... ERRORS was specified during pipeline creation, they will appear as a JSON key/value pair like so: {"load_error_policy": "skip_all_errors"} |
dup_key_policy |
The duplicate key policy that indicates how a row should be inserted if it contains a duplicate key value. |
table |
The name of the table in which to insert data. |
fields_terminated_by |
The character that terminates a field. |
fields_enclosed_by |
The character that encloses a field. |
fields_escaped_by |
The character that escapes a field. |
lines_terminated_by |
The character that terminates a line. |
lines_starting_by |
The string prefix for a line. |
extended_null |
Specifies whether the non-quoted and case-insensitive string null will be loaded as a null type. |
column_list |
The column list to load data into. |
on_duplicate_key_update |
Specifies whether duplicate keys will be updated or ignored. |
running |
Specifies whether the pipeline is currently running. current state of the pipeline, which is either running , error , or stopped . |
information_schema.PIPELINES_BATCHES_SUMMARY Table
The PIPELINES_BATCHES_SUMMARY
table contains high-level information about individual batches as they are loaded into the database. The columns in this table are described below. It contains one row per batch.
information_schema.PIPELINES_BATCHES_SUMMARY Table Schema
Name | Description |
---|---|
DATABASE_NAME |
The name of the database associated with the batch. |
PIPELINE_NAME |
The name of the pipeline associated with the batch. |
BATCH_ID |
The internal unique identifier for the batch. Values of BATCH_ID are unique in this table. |
BATCH_STATE |
Specifies whether the batch transaction was successful. There are five possible values: Succeeded , In Progress , Failed , Queued or No Data .- Succeeded : The batch was successfully loaded into the destination table.- In Progress : The batch is currently being loaded into the destination table.- Failed : The batch failed to be written. Batch failures are written as errors in the information_schema.PIPELINES_ERRORS table and you can use the BATCH_ID to correlate a specific batch failure with more detailed error information.- Queued : The batch is waiting for enough resources in the cluster to free up so it can run.- No Data : The batch contained no data. Batches that don’t contain data will be deleted as new batches are written. |
START_TIME |
The time that the batch transaction started in DATETIME format. |
BATCH_TIME |
The number of seconds the batch took to run, or if the batch is In Progress , the number of seconds it has been running for. |
ROWS_PER_SEC |
Equal to ROWS_STREAMED / BATCH_TIME . |
ROWS_STREAMED |
The number of rows extracted extracted into MemSQL. It can be higher or lower than the number of rows written, if, for instance, rows are filtered by a WHERE clause or SKIP clause, or if rows are modified by an upsert clause. |
NUM_PARTITIONS |
The total number of batch partitions for this batch. |
NUM_PARTITIONS_FINISHED |
The total number of finished batch partitions for this batch. A batch cannot finish until all batch partitions are finished. |
MB_STREAMED |
The number of megabytes extracted into MemSQL. |
MB_PER_SEC |
Equal to MB_STREAMED / BATCH_TIME . |
information_schema.PIPELINES_BATCHES Table
The PIPELINES_BATCHES
table contains detailed, low-level information about individual batches as they’re loaded into the database. The columns in this table are described below. It contains one row per batch partition and is useful for understanding partition-level details of a pipeline’s behavior.
information_schema.PIPELINES_BATCHES Table Schema
Name | Description |
---|---|
DATABASE_NAME |
The name of the database associated with the batch. |
PIPELINE_NAME |
The name of the pipeline associated with the batch. |
BATCH_ID |
The internal unique identifier for the batch. While every row in the table contains unique information about a batch, multiple rows with the same BATCH_ID may be present. |
BATCH_STATE |
Specifies whether the batch transaction was successful. There are four possible values: Succeeded , In Progress , Failed , Queued or No Data .- Succeeded : The batch was successfully loaded into the destination table.- In Progress : The batch is currently being loaded into the destination table.- Failed : The batch failed to be written. Batch failures are written as errors in the information_schema.PIPELINES_ERRORS table, and you can use the BATCH_ID to correlate a specific batch failure with more detailed error information.- Queued : The batch is waiting for enough resources in the cluster to free up so it can run.- No Data : The batch contained no data. Batches that don’t contain data will be deleted as new batches are written. |
BATCH_ROWS_WRITTEN |
The number of rows that were affected in the destination table as a result of the batch transaction. When batches are written to the destination table, they consist of insert, replace, upsert, or delete operations. The BATCH_ROWS_WRITTEN value is the sum total of all the operations that occurred when the batch was written.This value may be NULL if the BATCH_STATE is any value other than Succeeded . |
BATCH_TIME |
The elapsed time in seconds from the start of the batch transaction to its success or failure. This value reflects the complete transaction time from the perspective of the master aggregator node. Therefore, the elapsed time may be longer than a BATCH_PARTITION_TIME value for a given batch, as BATCH_PARTITION_TIME reflects the perspective of the leaf node. |
BATCH_START_UNIX_TIMESTAMP |
The time that the batch transaction started in Unix timestamp format. |
BATCH_PARTITION_STATE |
Specifies whether the batch transaction was successful for a given partition. There are five possible values: Succeeded , In Progress , Failed , No Data , or NULL .- Succeeded : The batch was successfully loaded into the destination table.- In Progress : The batch is currently being loaded into the destination table.- Failed : The batch failed to be written. Batch failures are written as errors in the information_schema.PIPELINES_ERRORS table, and you can use the BATCH_ID and PARTITION values to correlate a specific batch failure with more detailed error information.- No Data : The batch contained no data. Batches that don’t contain data will be deleted as new batches are written.- NULL : The batch data either contains no data, or the batch transaction hasn’t yet been executed on the partition itself. |
BATCH_PARTITION_PARSED_ROWS |
The number of rows that were parsed during the batch transaction for a given partition. This value can be NULL if the batch transaction is still in progress or if the batch failed. |
BATCH_SOURCE_PARTITION_ID |
The identifier for a data source’s partition from which batch partition data is extracted. This value may be NULL if the batch’s BATCH_PARTITION_STATE is also NULL . For an S3 pipeline, this represents the name of an object in the source bucket. |
BATCH_EARLIEST_OFFSET |
Specifies the earliest offset for the batch partition. This value indicates the start of the offset range for a batch, while BATCH_LATEST_OFFSET indicates the end of the offset range. This value may be NULL if the batch’s BATCH_PARTITION_STATE is also NULL . For an S3 pipeline, this value will always be 0 when a bucket’s object has been successfully processed. For more information, see S3 Pipelines Overview |
BATCH_LATEST_OFFSET |
Specifies the latest offset for the batch. This value indicates the end of the offset range for a batch, while BATCH_EARLIEST_OFFSET indicates the start of the offset range. This value may be NULL if the batch’s BATCH_PARTITION_STATE is also NULL . For an S3 pipeline, this value will always be 1 when a bucket’s object has been successfully processed. For more information, see S3 Pipelines Overview |
BATCH_PARTITION_TIME |
The elapsed time in seconds from the start of the batch transaction to its success or failure. This value reflects the batch transaction time from the perspective of the leaf node. Therefore, the elapsed time may be shorter than a BATCH_TIME for a given batch, as BATCH_TIME reflects the perspective of the master aggregator node. |
BATCH_PARTITION_EXTRACTED_BYTES |
The number of bytes that was extracted from the data source for the batch. This value may be NULL if no data was extracted for the batch. |
BATCH_PARTITION_TRANSFORMED_BYTES |
The number of bytes that were transformed for the batch. This value may be NULL if a transform was not specified for the pipeline, or if no data was successfully transformed. |
BATCH_PARTITION_EXTRACTOR_WAIT_TIME |
The amount of time, in seconds, that the pipeline spent waiting on the extractor to complete its operation. |
BATCH_PARTITION_TRANSFORM_WAIT_TIME |
The amount of time, in seconds, that the pipeline spent waiting on the transform to complete its operation. |
HOST |
The hostname or host IP address for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL .The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
PORT |
The port number for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL .The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
PARTITION |
Specifies the partition ID on a leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID is also NULL .The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
information_schema.PIPELINES_ERRORS Table
The PIPELINES_ERRORS
table contains detailed information about errors that occurred during extraction, transformation, or loading. Each row represents a single error event. The columns in this table are described below.
This table contains exhaustive node-level error information that may originate at either a leaf node or aggregator node. If a row appears to be a duplicate, first verify that the row’s ERROR_MESSAGE
entry does not start with LEAF ERROR (<host>:<port>)
. Error event information may be propagated from a leaf to an aggregator or from a leaf to another leaf. This text indicates the specific leaf node origin for the error.
information_schema.PIPELINES_ERRORS Table Schema
Column Name | Description |
---|---|
DATABASE_NAME |
The name of the database associated with the error. |
PIPELINE_NAME |
The name of the pipeline associated with the error. |
ERROR_UNIX_TIMESTAMP |
The time of the error event in Unix timestamp format. |
ERROR_TYPE |
Specifies what type of error occurred. Possible values are Error or Warning .- Error : An error occurred, which may have stopped the pipeline if the pipelines_stop_on_error variable is set to ON .- Warning : A warning occurred, which does not stop a pipeline. |
ERROR_CODE |
The error code for the error. |
ERROR_MESSAGE |
The message associated with the error. This value contains contextual information about the error that can be used for debugging purposes, including a stack trace if the error was caused by a transform failure. |
ERROR_KIND |
Specifies whether the error event was caused by internal or external factors. Possible values are Internal , Extract , Transform , or Load .- Internal : An internal error occurred within the Pipelines feature itself.- Extract : An error occurred during extraction from the data source. Extraction errors are typically caused by network availability or problems with the data source partitions.- Transform : An error occurred during data transformation. The transform executable is typically the cause of transformation errors.- Load : An error occurred when attempting to load data into the destination table. Load errors are typically caused by malformed CSV data or attempting to write invalid schema types into a column, such as a NULL value into a non-nullable column. |
STD_ERROR |
The text that may have been outputted during the data transformation. The origin of this text is the transform executable itself.This value can be empty or NULL if the ERROR_KIND value is not Transform or if no standard error text was outputted by the transform during failure. |
LOAD_DATA_LINE |
The text of a LOAD DATA statement that caused a parsing error while attempting to load data into the destination table. This value contains the invalid line of CSV text that failed to parse. Load errors are typically caused by malformed CSV data or attempting to write invalid schema types into a column, such as a NULL value into a non-nullable column.This value can be empty or NULL if the error didn’t occur during the loading phase of pipeline execution. |
LOAD_DATA_LINE_NUMBER |
The line number of a LOAD DATA statement that caused a parsing error while attempting to load data into the destination table. A LOAD DATA statement may consist of many lines, and this value indicates the specific invalid line. This line number can be correlated with the value of LOAD_DATA_LINE , which contains the invalid line’s text.This value may be empty or NULL if a LOAD DATA statement wasn’t associated with the error event. |
BATCH_ID |
The internal unique identifier for the batch that experienced an error event. |
ERROR_ID |
The internal unique identifier for the error event. |
BATCH_SOURCE_PARTITION_ID |
The data source’s partition ID for the batch. This value may be NULL if the error occurred on the master aggregator. |
BATCH_EARLIEST_OFFSET |
Specifies the earliest offset for the batch. This value indicates the start of the offset range for a batch, while BATCH_LATEST_OFFSET indicates the end of the offset range. This value may be NULL if the error occurred on the master aggregator. |
BATCH_LATEST_OFFSET |
Specifies the latest offset for the batch. This value indicates the end of the offset range for a batch, while BATCH_EARLIEST_OFFSET indicates the start of the offset range. This value may be NULL if the error occurred on the master aggregator. |
HOST |
The hostname or host IP address for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL.The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
PORT |
The port number for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL .The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
PARTITION |
Specifies the partition ID on a leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID is also NULL .The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
information_schema.PIPELINES_CURSORS Table
The PIPELINES_CURSORS
table contains information about a pipeline’s offset ranges. Each row represents a single data source partition. The columns in this table are described below.
information_schema.PIPELINES_CURSORS Table Schema
Column Name | Description |
---|---|
DATABASE_NAME |
The name of the database associated with the offsets. |
PIPELINE_NAME |
The name of the pipeline associated with the offsets. |
SOURCE_TYPE |
The name of the pipeline’s data source, such as KAFKA or S3 . |
SOURCE_PARTITION_ID |
The identifier for a data source’s partition from which offsets are extracted. For an S3 pipeline, this value represents the name of a bucket’s object. |
EARLIEST_OFFSET |
The earliest offset for an offset range. This value indicates the starting offset and LATEST_OFFSET indicates the ending offset in an offset range. This value may be NULL if data extraction is still in progress or failed. |
LATEST_OFFSET |
The bounding offset for an offset range, exclusive. This value indicates the ending offset and EARLIEST_OFFSET indicates the starting offset in an offset range. This value may be NULL if data extraction is still in progress or failed. |
CURSOR_OFFSET |
The next offset at which the pipeline will start loading in this partition. This number represents the end of the range which has already been loaded by this pipeline, exclusive; all future considered offsets will be beyond this. |
SUCCESSFUL_CURSOR_OFFSET |
The offset cursor position in this source partition that has been successfully loaded. If this is lower than LATEST_LOADED_OFFSET , it implies the intermediate offsets were skipped. Note: offsets will only be skipped if pipelines_stop_on_error is false. |
UPDATED_UNIX_TIMESTAMP |
The last updated time for an offset range in Unix timestamp format. |
EXTRA_FIELDS |
Some extractors output additional information when extracting data from the source. For S3 pipelines, you may encounter a value of {"size":<bytes>} , where <bytes> represents the size of an object in the source bucket. |
Prior to MemSQL 6.5, MemSQL had a similar (now deprecated) table called PIPELINES_OFFSETS
. These tables contained the same information, but with a slightly different semantics. In particular, the “high offsets” in this table are exclusive instead of inclusive.
information_schema.PIPELINES_FILES Table
The PIPELINES_FILES
table stores information about files that have been extracted from a filesystem-like data source, such as Amazon S3. Each row represents a single file. The columns in this table are described below.
information_schema.PIPELINES_FILES Table Schema
Column Name | Description |
---|---|
DATABASE_NAME |
The name of the database associated with the file. |
PIPELINE_NAME |
The name of the pipeline associated with the file. |
SOURCE_TYPE |
The name of the pipeline’s data source, such as S3 . |
FILE_NAME |
The name of the file. |
FILE_SIZE |
The size of the file in bytes. |
FILE_STATE |
Shows the state of the file, which is either Loaded , Skipped or Unloaded . |