Outdated Version

You are viewing an older version of this section. View current production version.

Pipelines Information Schema Tables

There are four tables in the information_schema database that are related to Pipelines. These tables contain exhaustive information about pipeline configuration and behavior. The information in these tables is especially useful for debugging errors that may occur during the extraction, transformation, or loading phases of a pipeline’s lifecycle.

information_schema.PIPELINES Table

The PIPELINES table stores high-level information about any pipelines that have been created in the cluster. Each row represents a single pipeline. The columns in this table are described below.

information_schema.PIPELINES Table Schema

Column Name	Description
`DATABASE_NAME`	The name of the database associated with the pipeline.
`PIPELINE_NAME`	The name of the pipeline.
`CONFIG_JSON`	The pipeline’s configuration in JSON format. This JSON is read only, and it’s automatically generated when your pipeline configuration changes. The JSON schema for this column is described in the section below, and is intended for use in a web-based application (such as MemSQL Ops).
`STATE`	The current state of the pipeline. Possible values are `Running`, `Error`, and `Stopped`. `Running`: The pipeline is currently running. If the `pipelines_stop_on_error` variable is set to `ON`, the pipeline has not encountered any errors during extraction, transformation, or loading. `Error`: The pipeline encountered an error and is currently stopped. When a pipeline is in the `Error` state, it must be manually started.If the `pipelines_stop_on_error` variable is set to `OFF`, a pipeline cannot enter the `Error` state. The pipeline will remain in the `Running` state until it’s manually stopped, and any errors that occur will be written to the `information_schema.PIPELINES_ERRORS` table. `Stopped`: The pipeline is currently stopped. The pipeline can only enter the `Stopped` state due to manual intervention.
`SKIPPED_BATCHES`	The sum of the number of batches that have been skipped in the pipeline. Batches may be skipped if the maximum number of batch retries was reached, which is set using the `pipelines_max_retries_per_batch_partition` variable.

CONFIG_JSON Schema

The CONFIG_JSON column in the information_schema.PIPELINES table contains a fixed set of read-only JSON key/value pairs. Some of these JSON values can also be seen by executing the SHOW CREATE PIPELINE <pipeline-name> statement. Each key/value pair is described below.

Example `CONFIG_JSON` for Kafka Pipelines

{
   "name":"mypipeline",
   "source_type":"KAFKA",
   "credentials":"172.17.0.2\/test-topic",
   "batch_interval":0,
   "transform":["http:\/\/127.0.0.1:8000\/transform.py","",""],
   "dup_key_policy":null,
   "table":"messages",
   "fields_terminated_by":"\t",
   "fields_enclosed_by":"",
   "fields_escaped_by":"\\",
   "lines_terminated_by":"\n",
   "lines_starting_by":"",
   "extended_null":false,
   "column_list":null,
   "on_duplicate_key_update":null,
   "running":false
}

Example `CONFIG_JSON` for S3 Pipelines

{
	"name": "my-s3-pipeline",
	"source_type": "S3",
	"connection_string": "my-s3-bucket-name",
	"config": "{\"region\": \"us-west-1\"}",
	"credentials": "<CREDENTIALS REDACTED>",
	"batch_interval": 2500,
	"max_partitions_per_batch": -1,
	"transform": null,
	"load_error_policy": null,
	"dup_key_policy": null,
	"table": "my_table_name",
	"fields_terminated_by": ",",
	"fields_enclosed_by": "",
	"fields_escaped_by": "\\",
	"lines_terminated_by": "\n",
	"lines_starting_by": "",
	"extended_null": false,
	"column_list": null,
	"on_duplicate_key_update": null
}

CONFIG_JSON Schema Definition

Key Name	Value Description
`name`	The name of the pipeline.
`source_type`	The data source type for the pipeline.
`connection_string`	The name of the S3 bucket or bucket’s object with optional prefix.
`config`	The configuration information provided when creating an S3 pipeline, namely the region where the source bucket is hosted.
`credentials`	Either the Kafka topic URL for the pipeline or `<CREDENTIALS REDACTED>` for an S3 pipeline.
`batch_interval`	The time duration in milliseconds between batch extraction operations.
`transform`	The transform’s URI, executable entry point, and arguments.
`load_error_policy`	The load error policy for the pipeline. For example, if `IGNORE` or `SKIP ... ERRORS` was specified during pipeline creation, they will appear as a JSON key/value pair like so: `{"load_error_policy": "skip_all_errors"}`
`dup_key_policy`	The duplicate key policy that indicates how a row should be inserted if it contains a duplicate key value.
`table`	The name of the table in which to insert data.
`fields_terminated_by`	The character that terminates a field.
`fields_enclosed_by`	The character that encloses a field.
`fields_escaped_by`	The character that escapes a field.
`lines_terminated_by`	The character that terminates a line.
`lines_starting_by`	The string prefix for a line.
`extended_null`	Specifies whether the non-quoted and case-insensitive string `null` will be loaded as a null type.
`column_list`	The column list to load data into.
`on_duplicate_key_update`	Specifies whether duplicate keys will be updated or ignored.
`running`	Specifies whether the pipeline is currently running. Current state of the pipeline, which is either `running`, `error`, or `stopped`.

information_schema.PIPELINES_BATCHES_SUMMARY Table

The PIPELINES_BATCHES_SUMMARY table contains high-level information about individual batches as they are loaded into the database. The columns in this table are described below. It contains one row per batch.

information_schema.PIPELINES_BATCHES_SUMMARY Table Schema

Name	Description
`DATABASE_NAME`	The name of the database associated with the batch.
`PIPELINE_NAME`	The name of the pipeline associated with the batch.
`BATCH_ID`	The internal unique identifier for the batch. Values of `BATCH_ID` are unique in this table.
`BATCH_STATE`	Specifies whether the batch transaction was successful. There are five possible values: `Succeeded`, `In Progress`, `Failed`, `Queued` or `No Data`. `Succeeded`: The batch was successfully loaded into the destination table. `In Progress`: The batch is currently being loaded into the destination table. `Failed`: The batch failed to be written. Batch failures are written as errors in the `information_schema.PIPELINES_ERRORS` table and you can use the `BATCH_ID` to correlate a specific batch failure with more detailed error information. `Queued`: The batch is waiting for enough resources in the cluster to free up so it can run. `No Data`: The batch contained no data. Batches that don’t contain data will be deleted as new batches are written.
`START_TIME`	The time that the batch transaction started in `DATETIME` format.
`BATCH_TIME`	The number of seconds the batch took to run, or if the batch is `In Progress`, the number of seconds it has been running for.
`ROWS_PER_SEC`	Equal to `ROWS_STREAMED` / `BATCH_TIME`.
`ROWS_STREAMED`	The number of rows extracted extracted into MemSQL. It can be higher or lower than the number of rows written, if, for instance, rows are filtered by a `WHERE` clause or `SKIP` clause, or if rows are modified by an upsert clause.
`NUM_PARTITIONS`	The total number of batch partitions for this batch.
`NUM_PARTITIONS_FINISHED`	The total number of finished batch partitions for this batch. A batch cannot finish until all batch partitions are finished.
`MB_STREAMED`	The number of megabytes extracted into MemSQL.
`MB_PER_SEC`	Equal to `MB_STREAMED` / `BATCH_TIME`.

information_schema.PIPELINES_BATCHES Table

The PIPELINES_BATCHES table contains detailed, low-level information about individual batches as they’re loaded into the database. The columns in this table are described below. It contains one row per batch partition and is useful for understanding partition-level details of a pipeline’s behavior.

information_schema.PIPELINES_BATCHES Table Schema

Name	Description
`DATABASE_NAME`	The name of the database associated with the batch.
`PIPELINE_NAME`	The name of the pipeline associated with the batch.
`BATCH_ID`	The internal unique identifier for the batch. While every row in the table contains unique information about a batch, multiple rows with the same `BATCH_ID` may be present.
`BATCH_STATE`	Specifies whether the batch transaction was successful. There are four possible values: `Succeeded`, `In Progress`, `Failed`, `Queued` or `No Data`. `Succeeded`: The batch was successfully loaded into the destination table. `In Progress`: The batch is currently being loaded into the destination table. `Failed`: The batch failed to be written. Batch failures are written as errors in the `information_schema.PIPELINES_ERRORS` table, and you can use the `BATCH_ID` to correlate a specific batch failure with more detailed error information. `Queued`: The batch is waiting for enough resources in the cluster to free up so it can run. `No Data`: The batch contained no data. Batches that don’t contain data will be deleted as new batches are written.
`BATCH_ROWS_WRITTEN`	The number of rows that were affected in the destination table as a result of the batch transaction. When batches are written to the destination table, they consist of insert, replace, upsert, or delete operations. The `BATCH_ROWS_WRITTEN` value is the sum total of all the operations that occurred when the batch was written. This value may be `NULL` if the `BATCH_STATE` is any value other than `Succeeded`.
`BATCH_TIME`	The elapsed time in seconds from the start of the batch transaction to its success or failure. This value reflects the complete transaction time from the perspective of the master aggregator node. Therefore, the elapsed time may be longer than a `BATCH_PARTITION_TIME` value for a given batch, as `BATCH_PARTITION_TIME` reflects the perspective of the leaf node.
`BATCH_START_UNIX_TIMESTAMP`	The time that the batch transaction started in Unix timestamp format.
`BATCH_PARTITION_STATE`	Specifies whether the batch transaction was successful for a given partition. There are five possible values: `Succeeded`, `In Progress`, `Failed`, `No Data`, or `NULL`. `Succeeded`: The batch was successfully loaded into the destination table. `In Progress`: The batch is currently being loaded into the destination table. `Failed`: The batch failed to be written. Batch failures are written as errors in the `information_schema.PIPELINES_ERRORS` table, and you can use the `BATCH_ID` and `PARTITION` values to correlate a specific batch failure with more detailed error information. `No Data`: The batch contained no data. Batches that don’t contain data will be deleted as new batches are written. `NULL`: The batch data either contains no data, or the batch transaction hasn’t yet been executed on the partition itself.
`BATCH_PARTITION_PARSED_ROWS`	The number of rows that were parsed during the batch transaction for a given partition. This value can be `NULL` if the batch transaction is still in progress or if the batch failed.
`BATCH_SOURCE_PARTITION_ID`	The identifier for a data source’s partition from which batch partition data is extracted. This value may be `NULL` if the batch’s `BATCH_PARTITION_STATE` is also `NULL`. For an S3 pipeline, this represents the name of an object in the source bucket.
`BATCH_EARLIEST_OFFSET`	Specifies the earliest offset for the batch partition. This value indicates the start of the offset range for a batch, while `BATCH_LATEST_OFFSET` indicates the end of the offset range. This value may be `NULL` if the batch’s `BATCH_PARTITION_STATE` is also `NULL`. For an S3 pipeline, this value will always be `0` when a bucket’s object has been successfully processed. For more information, see S3 Pipelines Overview
`BATCH_LATEST_OFFSET`	Specifies the latest offset for the batch. This value indicates the end of the offset range for a batch, while `BATCH_EARLIEST_OFFSET` indicates the start of the offset range. This value may be `NULL` if the batch’s `BATCH_PARTITION_STATE` is also `NULL`. For an S3 pipeline, this value will always be `1` when a bucket’s object has been successfully processed. For more information, see S3 Pipelines Overview
`BATCH_PARTITION_TIME`	The elapsed time in seconds from the start of the batch transaction to its success or failure. This value reflects the batch transaction time from the perspective of the leaf node. Therefore, the elapsed time may be shorter than a `BATCH_TIME` for a given batch, as `BATCH_TIME` reflects the perspective of the master aggregator node.
`BATCH_PARTITION_EXTRACTED_BYTES`	The number of bytes that was extracted from the data source for the batch. This value may be `NULL` if no data was extracted for the batch.
`BATCH_PARTITION_TRANSFORMED_BYTES`	The number of bytes that were transformed for the batch. This value may be `NULL` if a transform was not specified for the pipeline, or if no data was successfully transformed.
`BATCH_PARTITION_EXTRACTOR_WAIT_TIME`	The amount of time, in seconds, that the pipeline spent waiting on the extractor to complete its operation.
`BATCH_PARTITION_TRANSFORM_WAIT_TIME`	The amount of time, in seconds, that the pipeline spent waiting on the transform to complete its operation.
`HOST`	The hostname or host IP address for the leaf node that processed the batch. This value may be `NULL` if `BATCH_SOURCE_PARTITION_ID` or `PARTITION` is also `NULL`. The combination of a batch’s `HOST`, `PORT`, and `PARTITION` identify the specific leaf node partition that attempted to load batch data.
`PORT`	The port number for the leaf node that processed the batch. This value may be `NULL` if `BATCH_SOURCE_PARTITION_ID` or `PARTITION` is also `NULL`. The combination of a batch’s `HOST`, `PORT`, and `PARTITION` identify the specific leaf node partition that attempted to load batch data.
`PARTITION`	Specifies the partition ID on a leaf node that processed the batch. This value may be `NULL` if `BATCH_SOURCE_PARTITION_ID` is also `NULL`. The combination of a batch’s `HOST`, `PORT`, and `PARTITION` identify the specific leaf node partition that attempted to load batch data.

information_schema.PIPELINES_ERRORS Table

The PIPELINES_ERRORS table contains detailed information about errors that occurred during extraction, transformation, or loading. Each row represents a single error event. The columns in this table are described below.

Info

This table contains exhaustive node-level error information that may originate at either a leaf node or aggregator node. If a row appears to be a duplicate, first verify that the row’s ERROR_MESSAGE entry does not start with LEAF ERROR (<host>:<port>). Error event information may be propagated from a leaf to an aggregator or from a leaf to another leaf. This text indicates the specific leaf node origin for the error.

information_schema.PIPELINES_ERRORS Table Schema

Column Name	Description
`DATABASE_NAME`	The name of the database associated with the error.
`PIPELINE_NAME`	The name of the pipeline associated with the error.
`ERROR_UNIX_TIMESTAMP`	The time of the error event in Unix timestamp format.
`ERROR_TYPE`	Specifies what type of error occurred. Possible values are `Error` or `Warning`. `Error`: An error occurred, which may have stopped the pipeline if the `pipelines_stop_on_error` variable is set to `ON`. `Warning`: A warning occurred, which does not stop a pipeline.
`ERROR_CODE`	The error code for the error.
`ERROR_MESSAGE`	The message associated with the error. This value contains contextual information about the error that can be used for debugging purposes, including a stack trace if the error was caused by a transform failure.
`ERROR_KIND`	Specifies whether the error event was caused by internal or external factors. Possible values are `Internal`, `Extract`, `Transform`, or `Load`. `Internal`: An internal error occurred within the Pipelines feature itself. `Extract`: An error occurred during extraction from the data source. Extraction errors are typically caused by network availability or problems with the data source partitions. `Transform`: An error occurred during data transformation. The transform executable is typically the cause of transformation errors. `Load`: An error occurred when attempting to load data into the destination table. Load errors are typically caused by malformed CSV data or attempting to write invalid schema types into a column, such as a `NULL` value into a non-nullable column.
`STD_ERROR`	The text that may have been outputted during the data transformation. The origin of this text is the transform executable itself. This value can be empty or `NULL` if the `ERROR_KIND` value is not `Transform` or if no standard error text was outputted by the transform during failure.
`LOAD_DATA_LINE`	The text of a `LOAD DATA` statement that caused a parsing error while attempting to load data into the destination table. This value contains the invalid line of CSV text that failed to parse. Load errors are typically caused by malformed CSV data or attempting to write invalid schema types into a column, such as a `NULL` value into a non-nullable column. This value can be empty or `NULL` if the error didn’t occur during the loading phase of pipeline execution.
`LOAD_DATA_LINE_NUMBER`	The line number of a `LOAD DATA` statement that caused a parsing error while attempting to load data into the destination table. A `LOAD DATA` statement may consist of many lines, and this value indicates the specific invalid line. This line number can be correlated with the value of `LOAD_DATA_LINE`, which contains the invalid line’s text. This value may be empty or `NULL` if a `LOAD DATA` statement wasn’t associated with the error event.
`BATCH_ID`	The internal unique identifier for the batch that experienced an error event.
`ERROR_ID`	The internal unique identifier for the error event.
`BATCH_SOURCE_PARTITION_ID`	The data source’s partition ID for the batch. This value may be `NULL` if the error occurred on the master aggregator.
`BATCH_EARLIEST_OFFSET`	Specifies the earliest offset for the batch. This value indicates the start of the offset range for a batch, while `BATCH_LATEST_OFFSET` indicates the end of the offset range. This value may be `NULL` if the error occurred on the master aggregator.
`BATCH_LATEST_OFFSET`	Specifies the latest offset for the batch. This value indicates the end of the offset range for a batch, while `BATCH_EARLIEST_OFFSET` indicates the start of the offset range. This value may be `NULL` if the error occurred on the master aggregator.
`HOST`	The hostname or host IP address for the leaf node that processed the batch. This value may be `NULL` if `BATCH_SOURCE_PARTITION_ID` or `PARTITION` is also NULL. The combination of a batch’s `HOST`, `PORT`, and `PARTITION` identify the specific leaf node partition that attempted to load batch data.
`PORT`	The port number for the leaf node that processed the batch. This value may be `NULL` if `BATCH_SOURCE_PARTITION_ID` or `PARTITION` is also `NULL`. The combination of a batch’s `HOST`, `PORT`, and `PARTITION` identify the specific leaf node partition that attempted to load batch data.
`PARTITION`	Specifies the partition ID on a leaf node that processed the batch. This value may be `NULL` if `BATCH_SOURCE_PARTITION_ID` is also `NULL`. The combination of a batch’s `HOST`, `PORT`, and `PARTITION` identify the specific leaf node partition that attempted to load batch data.

information_schema.PIPELINES_CURSORS Table

The PIPELINES_CURSORS table contains information about a pipeline’s offset ranges. Each row represents a single data source partition. The columns in this table are described below.

information_schema.PIPELINES_CURSORS Table Schema

Column Name	Description
`DATABASE_NAME`	The name of the database associated with the offsets.
`PIPELINE_NAME`	The name of the pipeline associated with the offsets.
`SOURCE_TYPE`	The name of the pipeline’s data source, such as `KAFKA` or `S3`.
`SOURCE_PARTITION_ID`	The identifier for a data source’s partition from which offsets are extracted. For an S3 pipeline, this value represents the name of a bucket’s object.
`EARLIEST_OFFSET`	The earliest offset for an offset range. This value indicates the starting offset and `LATEST_OFFSET` indicates the ending offset in an offset range. This value may be `NULL` if data extraction is still in progress or failed.
`LATEST_OFFSET`	The bounding offset for an offset range, exclusive. This value indicates the ending offset and `EARLIEST_OFFSET` indicates the starting offset in an offset range. This value may be `NULL` if data extraction is still in progress or failed.
`CURSOR_OFFSET`	The next offset at which the pipeline will start loading in this partition. This number represents the end of the range which has already been loaded by this pipeline, exclusive; all future considered offsets will be beyond this.
`SUCCESSFUL_CURSOR_OFFSET`	The offset cursor position in this source partition that has been successfully loaded. If this is lower than `LATEST_LOADED_OFFSET`, it implies the intermediate offsets were skipped. Note: offsets will only be skipped if `pipelines_stop_on_error` is false.
`UPDATED_UNIX_TIMESTAMP`	The last updated time for an offset range in Unix timestamp format.
`EXTRA_FIELDS`	Some extractors output additional information when extracting data from the source. For S3 pipelines, you may encounter a value of `{"size":<bytes>}`, where `<bytes>` represents the size of an object in the source bucket.

Warning

Previously, MemSQL had a similar (now deprecated) table called PIPELINES_OFFSETS, which has been replaced with PIPELINES_CURSORS. These tables contained the same information, but with a slightly different semantics. In particular, the “high offsets” in the PIPELINES_CURSORS table are exclusive instead of inclusive; these include LATEST_OFFSET, CURSOR_OFFSET, and SUCCESSFUL_CURSOR_OFFSET. Here, exclusive means that the highest offset value in a given partition has already been loaded. If these tables were inclusive, the highest offset value would be the next offset that will be loaded.

information_schema.PIPELINES_FILES Table

The PIPELINES_FILES table stores information about files that have been extracted from a filesystem-like data source, such as Amazon S3. Each row represents a single file. The columns in this table are described below.

information_schema.PIPELINES_FILES Table Schema

Column Name	Description
`DATABASE_NAME`	The name of the database associated with the file.
`PIPELINE_NAME`	The name of the pipeline associated with the file.
`SOURCE_TYPE`	The name of the pipeline’s data source, such as `S3`.
`FILE_NAME`	The name of the file.
`FILE_SIZE`	The size of the file in bytes.
`FILE_STATE`	Shows the state of the file, which is either `Loaded`, `Skipped` or `Unloaded`.

Was this Article Helpful?

Questions?

Ask the Community, find solutions in the Troubleshooting Overview, get Self-Paced Training or learn more about our Support Plans.