Outdated Version

You are viewing an older version of this section. View current production version.

Pipelines System Variables

MemSQL Pipelines uses a few system variables that are either specific to an extractor or generic for the feature itself. You can see these variables and their default setting by executing the SHOW VARIABLES LIKE 'pipeline%'; statement:

memsql> SHOW VARIABLES LIKE '%pipeline%';
+--------------------------------------------+---------+
| Variable_name                              | Value   |
+--------------------------------------------+---------+
| pipelines_batches_metadata_to_keep         | 1000    |
| pipelines_extractor_debug_logging          | OFF     |
| pipelines_extractor_get_offsets_timeout_ms | 10000   |
| pipelines_kafka_version                    | 0.8.2.2 |
| pipelines_max_errors_per_partition         | 1000    |
| pipelines_max_offsets_per_batch_partition  | 1000000 |
| pipelines_max_retries_per_batch_partition  | 4       |
| pipelines_stderr_bufsize                   | 65535   |
| pipelines_stop_on_error                    | ON      |
+--------------------------------------------+---------+
9 rows in set (0.00 sec)
Info

You cannot set a variable for a specific pipeline – each variable setting applies to all pipelines in the cluster.

The following table describes each of these variables, arranged by general configuration and extractor-specific configuration:

To learn how to set pipelines variables in your cluster, see the How to Set Pipelines System Variables section below.

General Pipelines System Variables

Variable Name Default Value Description
pipelines_batches_metadata_to_keep 1000 The number of batch metadata entries to persist before they are overwritten by incoming batches.
As data is extracted from a source, it’s written in batches to a destination table on a leaf node. Metadata about these batches is temporarily persisted in the master aggregator’s information_schema.PIPELINES_BATCHES table. As new batches are loaded into the database, the oldest batch metadata entries will be removed from the information_schema.PIPELINES_BATCHES table. See the information_schema.PIPELINES_BATCHES Table section for more information about this metadata.
pipelines_extractor_debug_logging OFF Specifies whether to enable extractor debugging for Kafka pipelines. This variable currently does not apply to S3 pipelines.
pipelines_extractor_get_offsets_timeout_ms 10000 The maximum time in milliseconds to wait for offset data to be returned from the data source before returning an error. Increase this value if you experience timeout errors, such as ERROR 1970 (HY000): Subprocess timed out.
pipelines_max_retries_per_batch_partition 4 The number of retry attempts for writing batch partition data to the destination table.
If pipelines_stop_on_error is set to OFF and the specified retry number is reached without success, the batch partition will be skipped and will not appear in the destination table. If a batch partition is skipped, data loss can occur.
If pipelines_stop_on_error is set to ON and the specified retry number is reached without success, the pipeline will stop. No batch partition data will be skipped.
This configuration variable applies to the entire batch transaction, which includes extraction from a data source, optional transformation, and loading of the data into the destination table. If the batch transaction fails at any point during extraction, transformation, or loading, it will be retried up to the specified number.
pipelines_stop_on_error ON Specifies whether or not each pipeline in the cluster should stop when an error occurs.
If set to OFF, batches will be retried up to the number specified in the pipelines_max_retries_per_batch_partition variable. After all retries have failed, the batch will be skipped. When a batch is skipped, data loss can occur.
If set to ON, the batch transaction that caused the error will be retried up to the number specified in the pipelines_max_retries_per_batch_partition variable. After all retries have failed, the pipeline will enter a Stopped state and must be manually started.
pipelines_max_errors_per_partition 1000 The maximum number of error event rows per leaf node partition to persist before they are deleted.
Once the specified number of rows in the information_schema.PIPELINES_ERRORS table is reached, the database will eventually remove the oldest rows from the table. The removal mechanism for older error data is based on heuristics. Old errors are guaranteed to exist up to the specified number, but they may not immediately be removed.
pipelines_stderr_bufsize 65535 The buffer size for standard error output in bytes. Error messages that exceed this size will be truncated when written to the information_schema.PIPELINES_ERRORS table. However, the complete standard error text can be viewed by using the BATCH_ID and querying the information_schema.PIPELINES_BATCHES table.
pipelines_max_offsets_per_batch_partition 1000000 The maximum number of data source partition offsets to extract in a single batch transaction. If the data source’s partition contains fewer than the specified number of offsets, all of the partition’s offsets will be batched into the destination table.

Kafka Extractor System Variables

Variable Name Default Value Description
pipelines_kafka_version 0.8.2.2 The Kafka version used for the Kafka extractor. While the default version is 0.8.2.2, newer versions can also be specified.

How to Set Pipelines System Variables

The system variables associated with pipelines must be set on all nodes in a cluster. Due to MemSQL’s distributed nature, users commonly expect that a modified system variable will be automatically applied to every node in the cluster. However, each node persists its own system variables, and therefore any modified system variable is scoped to a single node.

It’s also important to consider what system variables have been modified when increasing your cluster size. Any modified system variable settings will not be automatically applied to new nodes when they’re added to the cluster; for example, no system variable inheritance occurs from the master aggregator to leaf nodes. Therefore when scaling your cluster, each new node’s settings must be updated with the same values as other nodes.

The easiest way to set system variables is to use MemSQL Ops’ memsql-update-config command, which can apply changes to each node in the cluster. If your cluster is not administered by MemSQL Ops, you can also set the variables manually on each node by using the SET GLOBAL command or by updating their respective memsql.cnf file.

The following sections describe how to use these options:

  • Setting Pipelines System Variables Using MemSQL Ops
  • Setting Pipelines System Variables Using SET GLOBAL VARIABLE
  • Setting Pipelines System Variables Using the memsql.cnf File

Setting Pipelines System Variables Using MemSQL Ops

If your cluster is administered by MemSQL Ops, the best way to change system variables is to use the memsql-update-config command. This command automatically updates one or more nodes’ memsql.cnf file. See the MEMSQL-UPDATE-CONFIG topic and the memsql.cnf topic for more information.

  1. While connected to the master aggregator node, create a new terminal window.
  2. At the prompt, execute the following command, replacing the pipelines_max_offsets_per_batch_partition variable key with your desired variable:
memsql-ops memsql-update-config --set-global --key pipelines_max_offsets_per_batch_partition --value 500000 --all

MemSQL Ops will attempt to change each node’s global system variable to the desired setting. If the operation was successful, MemSQL Ops will output the following message for each node:

Successfully updated config for MemSQL node <node-ID>.
  1. You can also verify that the change was successful by connecting to the master aggregator node with a MySQL client and executing the following command:
memsql> SHOW GLOBAL VARIABLES LIKE '%pipelines_max_offsets_per_batch_partition%';
+-------------------------------------------+--------+
| Variable_name                             | Value  |
+-------------------------------------------+--------+
| pipelines_max_offsets_per_batch_partition | 500000 |
+-------------------------------------------+--------+
1 row in set (0.00 sec)

Setting Pipelines System Variables Using SET GLOBAL VARIABLE

The SET GLOBAL VARIABLE command can be used to change a pipelines global variable setting. Here the GLOBAL keyword implies that the variable setting is persistent across all current and future client connections on the server. If the GLOBAL keyword is not used and SET VARIABLE is executed instead, a variable setting applies only to your current client connection, not all other current and future client connections. Some system variables in MemSQL can only be set globally, including pipelines variables. Therefore, they must be set as GLOBAL variables because their values must be persisted across all client connections.

To apply system variable changes across all nodes in the cluster, the SET GLOBAL command must be executed on each node individually. If you add more nodes to your cluster after setting a variable, you must also apply the same variable settings to each additional node.

  1. Use a MySQL client to connect to the desired node. Note that every node in the cluster must be updated.
  2. At the prompt, execute the following command, replacing the pipelines_max_offsets_per_batch_partition variable name with your desired variable:
memsql> SET GLOBAL pipelines_max_offsets_per_batch_partition = 500000;
Query OK, 0 rows affected (0.00 sec)

You can also verify that the change was successful by executing the following command:

memsql> SHOW GLOBAL VARIABLES LIKE '%pipelines_max_offsets_per_batch_partition%';
+-------------------------------------------+--------+
| Variable_name                             | Value  |
+-------------------------------------------+--------+
| pipelines_max_offsets_per_batch_partition | 500000 |
+-------------------------------------------+--------+
1 row in set (0.00 sec)

Setting Pipelines System Variables Using the memsql.cnf Configuration File

Each node has its own memsql.cnf file that is used to load the node’s configuration when MemSQL starts. System variables can be specified in this file, and they will automatically be applied to the node’s configuration when the node starts.

To set system variables in the memsql.cnf file:

  1. Ensure that the cluster is in a stopped state before continuing.
  2. Open the node’s memsql.cnf file in a text editor. Note that every other node’s memsql.cnf file must be updated as well.
  3. Add the following line to the file anywhere below the [server] declaration, replacing the pipelines_max_offsets_per_batch_partition variable name with your desired variable:
[server]
...
pipelines_max_offsets_per_batch_partition = 500000
...
  1. Save the file.
  2. Insert the same line to each node’s memsql.cnf file.
  3. After the memsql.cnf file has been updated for each node in the cluster, start the cluster.
  4. Verify that the change was successful by executing the following command on one of the affected nodes:
memsql> SHOW GLOBAL VARIABLES LIKE '%pipelines_max_offsets_per_batch_partition%';
+-------------------------------------------+--------+
| Variable_name                             | Value  |
+-------------------------------------------+--------+
| pipelines_max_offsets_per_batch_partition | 500000 |
+-------------------------------------------+--------+
1 row in set (0.00 sec)