Pipelines Information Schema Tables
There are four tables in the information_schema
database that are related to Pipelines:
These tables contain exhaustive information about pipeline configuration and behavior. The information in these tables is especially useful for debugging errors that may occur during the extraction, transformation, or loading phases of a pipeline’s lifecycle.
information_schema.PIPELINES Table
The PIPELINES
table stores high-level information about any pipelines that have been created in the cluster. Each row represents a single pipeline. The columns in this table are described below.
information_schema.PIPELINES Table Schema
Column Name | Description |
---|---|
DATABASE_NAME |
The name of the database associated with the pipeline. |
PIPELINE_NAME |
The name of the pipeline. |
CONFIG_JSON |
The pipeline’s configuration in JSON format. This JSON is read only, and it’s automatically generated when your pipeline configuration changes. The JSON schema for this column is described in the section below, and is intended for use in a web-based application (such as MemSQL Ops). |
STATE |
The current state of the pipeline. Possible values are Running , Error , and Stopped . Running : The pipeline is currently running. If the pipelines_stop_on_error variable is set to ON , the pipeline has not encountered any errors during extraction, transformation, or loading. Error : The pipeline encountered an error and is currently stopped. When a pipeline is in the Error state, it must be manually started. If the pipelines_stop_on_error variable is set to OFF , a pipeline cannot enter the Error state. The pipeline will remain in the Running state until it’s manually stopped, and any errors that occur will be written to the information_schema.PIPELINES_ERRORS table. Stopped : The pipeline is currently stopped. The pipeline can only enter the Stopped state due to manual intervention. |
SKIPPED_BATCHES |
The sum of the number of batches that have been skipped in the pipeline. Batches may be skipped if the maximum number of batch retries was reached, which is set using the pipelines_max_retries_per_batch_partition variable. |
CONFIG_JSON Schema
The CONFIG_JSON
column in the information_schema.PIPELINES
table contains a fixed set of read-only JSON key/value pairs. Some of these JSON values can also be seen by executing the SHOW CREATE PIPELINE <pipeline-name>
statement. Each key/value pair is described below.
{
"name":"mypipeline",
"source_type":"KAFKA",
"credentials":"172.17.0.2\/test-topic",
"batch_interval":0,
"transform":["http:\/\/127.0.0.1:8000\/transform.py","",""],
"dup_key_policy":null,
"table":"messages",
"fields_terminated_by":"\t",
"fields_enclosed_by":"",
"fields_escaped_by":"\\",
"lines_terminated_by":"\n",
"lines_starting_by":"",
"extended_null":false,
"column_list":null,
"on_duplicate_key_update":null,
"running":false
}
CONFIG_JSON Schema Definition
Key Name | Value Description |
---|---|
name |
The name of the pipeline. |
source_type |
The data source type for the pipeline. |
credentials |
The Kafka topic URL for the pipeline. |
batch_interval |
The time duration in milliseconds between batch extraction operations. |
transform |
The transform’s URI, executable entrypoint, and arguments. |
dup_key_policy |
The duplicate key policy that indicates how a row should be inserted if it contains a duplicate key value. |
table |
The name of the table in which to insert data. |
fields_terminated_by |
The character that terminates a field. |
fields_enclosed_by |
The character that encloses a field. |
fields_escaped_by |
The character that escapes a field. |
lines_terminated_by |
The character that terminates a line. |
lines_starting_by |
The string prefix for a line. |
extended_null |
Specifies whether the non-quoted and case-insensitive string null will be loaded as a null type. |
column_list |
The column list to load data into. |
on_duplicate_key_update |
Specifies whether duplicate keys will be updated or ignored. |
running |
Specifies whether the pipeline is currently running. current state of the pipeline, which is either running , error , or stopped . |
information_schema.PIPELINES_BATCHES Table
The PIPELINES_BATCHES
table stores detailed information about individual batches as they’re loaded into the database. The columns in this table are described below.
information_schema.PIPELINES_BATCHES Table Schema
Name | Description |
---|---|
DATABASE_NAME |
The name of the database associated with the batch. |
PIPELINE_NAME |
The name of the pipeline associated with the batch. |
BATCH_ID |
The internal unique identifier for the batch. While every row in the table contains unique information about a batch, multiple rows with the same BATCH_ID may be present. |
BATCH_STATE |
Specifies whether the batch transaction was successful. There are four possible values: Succeeded , In Progress , Failed , or No Data . Succeeded : The batch was successfully loaded into the destination table. In Progress : The batch is currently being loaded into the destination table. Failed : The batch failed to be written. Batch failures are written as errors in the information_schema.PIPELINES_ERRORS table, and you can use the BATCH_ID to correlate a specific batch failure with more detailed error information. No Data : The batch contained no data. Batches that don’t contain data will be deleted as new batches are written. |
BATCH_ROWS_WRITTEN |
The number of rows that were affected in the destination table as a result of the batch transaction. When batches are written to the destination table, they consist of insert, replace, upsert, or delete operations. The BATCH_ROWS_WRITTEN value is the sum total of all the operations that occurred when the batch was written. This value may be NULL if the BATCH_STATE is any value other than Succeeded . |
BATCH_TIME |
The elapsed time in seconds from the start of the batch transaction to its success or failure. This value reflects the complete transaction time from the perspective of the master aggregator node. Therefore, the elapsed time may be longer than a BATCH_PARTITION_TIME value for a given batch, as BATCH_PARTITION_TIME reflects the perspective of the leaf node. |
BATCH_START_UNIX_TIMESTAMP |
The time that the batch transaction started in Unix timestamp format. |
BATCH_PARTITION_STATE |
Specifies whether the batch transaction was successful for a given partition. There are five possible values: Succeeded , In Progress , Failed , No Data , or NULL . Succeeded : The batch was successfully loaded into the destination table. In Progress : The batch is currently being loaded into the destination table. Failed : The batch failed to be written. Batch failures are written as errors in the information_schema.PIPELINES_ERRORS table, and you can use the BATCH_ID and PARTITION values to correlate a specific batch failure with more detailed error information. No Data : The batch contained no data. Batches that don’t contain data will be deleted as new batches are written. NULL : The batch data either contains no data, or the batch transaction hasn’t yet been executed on the partition itself. |
BATCH_PARTITION_PARSED_ROWS |
The number of rows that were parsed during the batch transaction for a given partition. This value can be NULL if the batch transaction is still in progress or if the batch failed. |
BATCH_SOURCE_PARTITION_ID |
The identifier for a data source’s partition from which batch partition data is extracted. This value may be NULL if the batch’s BATCH_PARTITION_STATE is also NULL . |
BATCH_EARLIEST_OFFSET |
Specifies the earliest offset for the batch partition. This value indicates the start of the offset range for a batch, while BATCH_LATEST_OFFSET indicates the end of the offset range. This value may be NULL if the batch’s BATCH_PARTITION_STATE is also NULL . |
BATCH_LATEST_OFFSET |
Specifies the latest offset for the batch. This value indicates the end of the offset range for a batch, while BATCH_EARLIEST_OFFSET indicates the start of the offset range. This value may be NULL if the batch’s BATCH_PARTITION_STATE is also NULL . |
BATCH_PARTITION_TIME |
The elapsed time in seconds from the start of the batch transaction to its success or failure. This value reflects the batch transaction time from the perspective of the leaf node. Therefore, the elapsed time may be shorter than a BATCH_TIME for a given batch, as BATCH_TIME reflects the perspective of the master aggregator node. |
HOST |
The hostname or host IP address for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL .The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
PORT |
The port number for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL .The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
PARTITION |
Specifies the partition ID on a leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID is also NULL .The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
information_schema.PIPELINES_ERRORS Table
The PIPELINES_ERRORS
table stores detailed information about errors that occurred during extraction, transformation, or loading. Each row represents a single error event. The columns in this table are described below.
This table contains exhaustive node-level error information that may originate at either a leaf node or aggregator node. If a row appears to be a duplicate, first verify that the row’s ERROR_MESSAGE
entry does not start with LEAF ERROR (<host>:<port>)
. Error event information may be propagated from a leaf to an aggregator or from a leaf to another leaf. This text indicates the specific leaf node origin for the error.
information_schema.PIPELINES_ERRORS Table Schema
Column Name | Description |
---|---|
DATABASE_NAME |
The name of the database associated with the error. |
PIPELINE_NAME |
The name of the pipeline associated with the error. |
ERROR_UNIX_TIMESTAMP |
The time of the error event in Unix timestamp format. |
ERROR_TYPE |
Specifies what type of error occurred. Possible values are Error or Warning . Error : An error occurred, which may have stopped the pipeline if the pipelines_stop_on_error variable is set to ON . Warning : A warning occurred, which does not stop a pipeline. |
ERROR_NUMBER |
The error code for the error. |
ERROR_MESSAGE |
The message associated with the error. This value contains contextual information about the error that can be used for debugging purposes, including a stack trace if the error was caused by a transform failure. |
ERROR_KIND |
Specifies whether the error event was caused by internal or external factors. Possible values are Internal , Extract , Transform , or Load . Internal : An internal error occurred within the Pipelines feature itself. Extract : An error occurred during extraction from the data source. Extraction errors are typically caused by network availability or problems with the data source partitions. Transform : An error occurred during data transformation. The transform executable is typically the cause of transformation errors. Load : An error occurred when attempting to load data into the destination table. Load errors are typically caused by malformed CSV data or attempting to write invalid schema types into a column, such as a NULL value into a non-nullable column. |
STD_ERROR |
The text that may have been outputted during the data transformation. The origin of this text is the transform executable itself.This value can be empty or NULL if the ERROR_KIND value is not Transform or if no standard error text was outputted by the transform during failure. |
LOAD_DATA_LINE |
The text of a LOAD DATA statement that caused a parsing error while attempting to load data into the destination table. This value contains the invalid line of CSV text that failed to parse. Load errors are typically caused by malformed CSV data or attempting to write invalid schema types into a column, such as a NULL value into a non-nullable column.This value can be empty or NULL if the error didn’t occur during the loading phase of pipeline execution. |
LOAD_DATA_LINE_NUMBER |
The line number of a LOAD DATA statement that caused a parsing error while attempting to load data into the destination table. A LOAD DATA statement may consist of many lines, and this value indicates the specific invalid line. This line number can be correlated with the value of LOAD_DATA_LINE , which contains the invalid line’s text.This value may be empty or NULL if a LOAD DATA statement wasn’t associated with the error event. |
BATCH_ID |
The internal unique identifier for the batch that experienced an error event. |
ERROR_ID |
The internal unique identifier for the error event. |
BATCH_SOURCE_PARTITION_ID |
The data source’s partition ID for the batch. This value may be NULL if the error occurred on the master aggregator. |
BATCH_EARLIEST_OFFSET |
Specifies the earliest offset for the batch. This value indicates the start of the offset range for a batch, while BATCH_LATEST_OFFSET indicates the end of the offset range. This value may be NULL if the error occurred on the master aggregator. |
BATCH_LATEST_OFFSET |
Specifies the latest offset for the batch. This value indicates the end of the offset range for a batch, while BATCH_EARLIEST_OFFSET indicates the start of the offset range. This value may be NULL if the error occurred on the master aggregator. |
HOST |
The hostname or host IP address for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL.The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
PORT |
The port number for the leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID or PARTITION is also NULL .The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
PARTITION |
Specifies the partition ID on a leaf node that processed the batch. This value may be NULL if BATCH_SOURCE_PARTITION_ID is also NULL .The combination of a batch’s HOST , PORT , and PARTITION identify the specific leaf node partition that attempted to load batch data. |
information_schema.PIPELINES_OFFSETS Table
The PIPELINES_OFFSETS
table stores information about a pipeline’s offset ranges. Each row represents a single data source partition. The columns in this table are described below.
information_schema.PIPELINES_OFFSETS Table Schema
Column Name | Description |
---|---|
DATABASE_NAME |
The name of the database associated with the offsets. |
PIPELINE_NAME |
The name of the pipeline associated with the offsets. |
SOURCE_TYPE |
The name of the pipeline’s data source, such as KAFKA . |
SOURCE_PARTITION_ID |
The identifier for a data source’s partition from which offsets are extracted. |
EARLIEST_OFFSET |
The earliest offset for an offset range. This value indicates the starting offset and LATEST_OFFSET indicates the ending offset in an offset range. This value may be NULL if data extraction is still in progress or failed. |
LATEST_OFFSET |
The latest offset for an offset range. This value indicates the ending offset and EARLIEST_OFFSET indicates the starting offset in an offset range. This value may be NULL if data extraction is still in progress or failed. |
LATEST_LOADED_OFFSET |
The latest offset in the offset range that has been successfully extracted. If a query is executed against the PIPELINES_OFFSETS table while extraction is still in progress for a given offset range, this value does not necessarily indicate a failure to load offsets. |
UPDATED_UNIX_TIMESTAMP |
The last updated time for an offset range in Unix timestamp format. |