Outdated Version

You are viewing an older version of this section. View current production version.

Statistics and ANALYZE

MemSQL collects statistics on table data to facilitate accurate query optimization. The query optimizer uses these statistics to help choose the best execution plan for each query.

Some statistics are collected and updated automatically in the background, while other statistics are collected by running the ANALYZE command. The ANALYZE command also is used to trigger plan re-optimization when data statistics change.

MemSQL uses two types of statistics (see the Types of statistics section for more detail):

  • Column statistics, which includes information on the cardinality (number of distinct values) of a column.
  • Histograms (also known as range statistics), which provide information on the distribution of data in a column.

Collecting column statistics is strongly recommended for optimal query performance. Collecting histograms is recommended in most situations. See the ANALYZE command section for more details.

Automatic statistics (abbreviated autostats) are collected and updated automatically in the background. Non-automatic statistics are collected by running the ANALYZE command. In MemSQL 6.0 and later, column statistics on columnstore tables are automatic, while column statistics on rowstore tables and all histograms are currently non-automatic. In MemSQL 5.x and earlier, all statistics are non-automatic.

The ANALYZE command

Non-automatic statistics for a table are collected and updated by running the ANALYZE command:

ANALYZE TABLE table_name;

In MemSQL 6.0 and later, non-automatic statistics include column statistics on rowstore tables and all histograms (on both rowstore and columnstore tables). In versions prior to 6.0, all statistics are non-automatic.

ANALYZE also triggers plan invalidation and re-optimization when statistics change, as described in the Plan invalidation section (note that this applies to both automatic and non-automatic statistics).

ANALYZE collects column statistics on every column (unless the table is using automatic statistics), as well as histograms on chosen columns. You may designate which columns to collect histograms on with the ANALYZE TABLE table_name COLUMNS ... {ENABLE | DISABLE} commands, as described below.

Collecting column statistics is strongly recommended for optimal query performance. Collecting histograms is recommended in most situations. If histograms are unavailable, the optimizer will use dynamic sampling to compute estimates: it will sample the table data by scanning part of the table. Whether histogram-based estimates or sampling-based estimates are more accurate depends on the table and query, but histograms are much faster to use than sampling during query optimization, since the optimizer merely needs to read the previously collected statistics, while sampling requires scanning a large amount of table data.

Non-automatic statistics are not updated when the table is modified through DML statements, including INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE. They are only collected and updated when the ANALYZE command is called, which completely overwrites any previous statistics collected on the table.

Warning
  • ANALYZE may invalidate plans, i.e. cause them to be reoptimized and recompiled the next time they are run, when data statistics change and a new plan could be faster. See the Plan invalidation section for more details.

  • Collecting non-automatic statistics with ANALYZE can be expensive for large tables, especially when collecting histograms. Because non-automatic statistics are collected only when you run ANALYZE, you can choose when to run it to best manage the performance impact.

  • ANALYZE does not block concurrent queries and can be safely used while running concurrent queries. Any query compilations will use the present statistics until the ANALYZE completes, at which point query compilations atomically switch to using the new statistics.

Notes

All forms of ANALYZE are subject to the following restrictions:

  • ANALYZE must be run on the master aggregator node.
  • ANALYZE requires ALTER and SELECT permissions on the target table.
  • As usual, you may specify USING db_name prior to ANALYZE, as in [USING db_name] ANALYZE ..., where db_name is a MemSQL database. The context database of the connection is used if the USING clause is omitted.
  • table_name must be the name of a table in the chosen database.
  • column_name must refer to a column in the table table_name.

The ANALYZE command has the following forms:

ANALYZE TABLE table_name;

Collect column statistics on all columns in table_name if it is not autostats-enabled, and collect histograms over all previously designated columns. Invalidate all plans on this table which were compiled with significantly different statistics than the newly observed statistics.

ANALYZE TABLE table_name COLUMNS column_name [, ...] ENABLE;

Designate the specified columns to have histograms collected, in addition to any previously designated columns. Then collects statistics and invalidates plans the same way as ANALYZE TABLE table_name.

ANALYZE TABLE table_name COLUMNS ALL ENABLE;

Designate all columns to have histograms collected, in addition to any previously designated columns. If the table contains columns which do not support histograms, a warning will be raised for each such column. Then collects statistics and invalidates plans the same way as ANALYZE TABLE table_name.

ANALYZE TABLE table_name COLUMNS column_name [, ...] DISABLE;

Delete histograms on specified columns if present, and designate the specified columns to not have histograms collected in the future. Does not collect any statistics.

ANALYZE TABLE table_name COLUMNS ALL DISABLE;

Delete histograms on all columns of the table if present, and designate all columns to not have histograms collected in the future. If the table contains columns which do not support histograms, a warning will be raised for each such column. Does not collect any statistics.

ANALYZE TABLE table_name DROP;

Delete all stored non-automatic statistics for a table, and disables histograms on any columns for which they have been enabled. Invalidates all plans on the table. Does not delete automatic statistics.

Exporting and importing statistics

It is also possible to export and import statistics. This can be useful for testing purposes - for example, you can copy statistics from one MemSQL cluster to another, in order to emulate the optimizer’s behavior on a test system that does not have the same data.

ANALYZE TABLE table_name INTO OUTFILE '/path/file.json';

Save column and histograms recorded on table_name into the file /path/file.json in a JSON representation. You can use absolute or relative paths; relative paths are rooted in the data directory of the MemSQL instance.

Note that this exports the current statistics; it does not collect new statistics. It uses whatever set of statistics would currently be used by for query optimization - for column statistics on autostats-enabled tables, this would be the automatic statistics if they are enabled and up-to-date, and otherwise the statistics last saved by ANALYZE.

ANALYZE TABLE table_name INFILE '/path/file.json';

Import range and column statistics for table_name from the file /path/file.json. You can use absolute or relative paths; relative paths are rooted in the data directory of the MemSQL instance.

This command is disallowed if autostats is enabled on the table. If you wish to use imported column statistics, you should disable autostats and then import the statistics. If you wish to use imported histograms only on an autostats-enabled table, you can temporarily disable autostats, run ANALYZE ... INFILE to import the statistics, and re-enable autostats.

Configuration parameters export and import

In addition, you can export and import configuration parameters used by the optimizer.

ANALYZE OPTIMIZER_STATE INTO OUTFILE 'file.json';
ANALYZE OPTIMIZER_STATE INFILE 'file.json';

Plan invalidation

When data statistics change, an old query plan optimized based on the old data statistics may no longer be the best for the new data statistics. It is often desirable to re-optimize the query plan after the data statistics change significantly, as a new plan may be significantly better than the old plan. Running ANALYZE triggers re-optimization of any plans based on statistics for that table that are significantly different than the current statistics, by invalidating each such plan and recompiling the corresponding query the next time it is run.

Plans are not invalidated unless you trigger invalidation by running ANALYZE. Even if automatic statistics change enough that a plan meets the threshold for plan invalidation eligibility, it will not be invalidated until you run ANALYZE. This is to prevent unexpected recompilation of plans when data changes - you must explicitly trigger recompilation by running ANALYZE.

MemSQL uses changes in the table rowcount as a heuristic for significant changes in statistics: a plan is eligible for invalidation if there is a table rowcount differing by a factor of 2 between the statistics it was compiled with and the current statistics, i.e. if an old rowcount (from the statistics at the time the query was compiled) is less than 50% or greater than 200% of the current rowcount. When you run ANALYZE, all plans eligible for invalidation (i.e. those where the current statistics and compile-time statistics differ in any table rowcount by at least 2x) are invalidated, i.e. they will be reoptimized and recompiled the next time they are run.

In addition, if autostats on a table were outdated when the plan was compiled, the plan will be eligible for invalidation after autostats become up-to-date (i.e. after autostats become up-to-date and you run ANALYZE, it will be invalidated).

Automatic statistics

MemSQL automatically collects and updates column statistics on columnstore tables, incrementally in the background. This greatly reduces the operational burden around maintaining these statistics with ANALYZE, ensures they are available without requiring user action, and keeps them up-to-date as table data changes with less performance impact.

Automatic statistics (abbreviated autostats) provides automatic maintenance of column statistics. Statistics are collected and incrementally updated in the background as the table is modified with DML statements such as INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE, as well as ALTER TABLE.

In MemSQL 6.0 and later, column statistics on columnstore tables are automatic by default. Automatic statistics collection is not currently supported for rowstore tables or for histograms. In versions prior to 6.0, all statistics are non-automatic.

While statistics are updated automatically, to re-optimize query plans after statistics change, you must trigger plan invalidation by running ANALYZE - see the Plan invalidation section.

Running ANALYZE also saves a copy of the table’s current automatic statistics. If you disable automatic statistics or they become out-of-date (see below), the query optimizer will use the last set of statistics saved by ANALYZE instead.

Autostats support upon upgrading MemSQL

For MemSQL clusters upgraded from 5.8 and earlier to 6.0 and later, autostats will remain disabled by default on existing columnstore tables created prior to the upgrade, while autostats will be enabled by default on new columnstore tables created after the upgrade. To enable autostats on existing columnstore tables created prior to the upgrade, which we recommend in most cases, run ALTER TABLE table_name ENABLE AUTOSTATS on each table. As discussed in Disabling and enabling autostats, this will begin the process of collecting automatic statistics in the background.

Info

Below is a simple script you can use to enable autostats on all tables (written in bash to minimize dependencies).

Save this script in a file, which we will call enable-autostats.sh:

#!/bin/bash

# Enables autostats on all columnstore tables which currently have autostats disabled
#
# Run this bash script with the mysql client arguments you use to connect to the MemSQL master aggregator, like:
# bash enable-autostats.sh -h<host> -P<port> -u<user> -p<password>
#
# Use the --dry-run option to only print the list of tables which would be altered, without actually running the ALTER commands.

set -e

client_args=(mysql --batch --skip-column-names)
dry_run=false

while [[ $# -gt 0 ]]
do
    case "$1" in
        --dry-run)
            dry_run=true
            ;;
        *)
            client_args+=("$1")
            ;;
    esac
    shift
done

# List of table names (fully qualified and quoted)
tables=$(${client_args[@]} -e "select concat(table_schema, '.\`', replace(table_name, '\`', '\`\`'), '\`') from information_schema.tables where table_type = 'BASE TABLE'")

echo "$tables" | while read -r table
do
    show_create_table=$(${client_args[@]} -e "show create table $table")
    if echo "$show_create_table" | grep -q "AUTOSTATS_ENABLED=FALSE"
    then
        echo "ALTER TABLE $table ENABLE AUTOSTATS"
        if [[ "$dry_run" == false ]]
        then
            ${client_args[@]} -e "ALTER TABLE $table ENABLE AUTOSTATS"
        fi
    fi
done

Run this bash script with the mysql client arguments you use to connect to the MemSQL master aggregator, like:

bash enable-autostats.sh -h<host> -P<port> -u<user> -p<password>

Disabling and enabling autostats

You can disable or enable autostats on each table. If you disable autostats on a table, statistics collection for that table becomes non-automatic, i.e. column statistics will be collected with the ANALYZE command. If autostats is re-enabled, automatic statistics collection will resume (see the next section for details). We do not recommend disabling autostats in most situations.

You can disable autostats on an existing table by running

ALTER TABLE table_name DISABLE AUTOSTATS

and enable autostats by running

ALTER TABLE table_name ENABLE AUTOSTATS

You can choose whether autostats are enabled or disabled when creating a table by adding the option

AUTOSTATS_ENABLED=<TRUE|FALSE>

at the end of the CREATE TABLE statement, such as:

CREATE TABLE t (a int, b int, key (a) using clustered columnstore) AUTOSTATS_ENABLED=FALSE

Additionally, you can change whether autostats is enabled or disabled by default on future CREATE TABLE statements by setting the global variable default_autostats_enabled to TRUE or FALSE on the master aggregator. The default is TRUE.

You can see whether autostats is enabled or disabled on a table by running SHOW CREATE TABLE.

Possible reasons to disable autostats include:

  • Automatic statistics requires a small memory overhead per column to maintain the statistics. If you have a large number of columns across all the columnstore tables in your database, this memory overhead may be significant.
  • Updating automatic statistics requires a small performance cost on writes to a table. You may wish to consider disabling autostats if you find this write performance cost to be significant on your workload, can achieve better overall performance with non-automatic statistics, and are willing to handle the operational requirements of managing non-automatic statistics.
  • If you want to “freeze” the statistics, so that the query optimizer continues to use a fixed set of statistics regardless of how the data is changing, you can do so by disabling automatic statistics and using ANALYZE to manage statistics.

Keep in mind that if you disable autostats, there are several potential downsides:

  • Autostats collects statistics automatically by default, and incrementally updates statistics as the table data changes, but if you disable autostats, you will need to run ANALYZE to collect statistics, and you will have to update statistics as necessary by running ANALYZE.
  • If the statistics become stale, the query optimizer may choose bad query plans with much worse execution performance.
  • Collecting non-automatic statistics with ANALYZE can be very costly for large tables. ANALYZE re-collects all statistics from scratch, which can be a very expensive operation on large tables, and can impact other workloads on the database. With autostats, the cost of updating statistics is amortized across writes, and there is no unnecessary work to re-collect statistics on parts of the table that aren’t changing.
  • By maintaining statistics incrementally, autostats can keep statistics much more up-to-date compared to batch statistics collection with ANALYZE.

Statistics population after enabling autostats

Immediately after enabling autostats on a table where it was previously disabled, MemSQL will start collecting automatic statistics in the background until it catches up with the current table data. During this time, the query optimizer will use the previous non-automatic statistics if available, until the autostats are sufficiently up-to-date - see the next section for details.

This background statistics collection uses limited resources to avoid large impacts on the performance of other workloads. Therefore, collecting automatic statistics may take a long time for large datasets, but this only affects the first time statistics are populated after autostats is enabled on a table where it was previously disabled.

Out-of-date autostats

If there is too large a difference between the data which MemSQL has gathered statistics on and the actual table data, such as immediately after autostats is enabled, autostats will be considered out-of-date. As soon as the difference becomes small enough, autostats will be considered up-to-date. While autostats are out-of-date, the query optimizer uses the statistics saved the last time ANALYZE was run on the table, if any. EXPLAIN will show a warning that autostats is out of date.

After statistics have been updated, you may run ANALYZE TABLE to trigger plan invalidation and re-optimization of any query plans that were based on stale statistics - see the Plan invalidation section.

Types of statistics

Column statistics

Column statistics are collected on every column in a table when the table is processed by automatic statistics or the ANALYZE command. They include the number of distinct values in each column, the number of NULLs in each column, and the number of rows in the table.

You can view the column statistics last collected by ANALYZE by querying the information_schema.OPTIMIZER_STATISTICS table. Note that this table shows the statistics when ANALYZE was last run on this table - for tables with automatic statistics, this table will reflect the statistics at the last time you ran ANALYZE, which means this table may show older statistics than the current automatic statistics, and it will not show any statistics if you have never run ANALYZE.

memsql> DESCRIBE INFORMATION_SCHEMA.OPTIMIZER_STATISTICS;
+----------------------+--------------+------+------+---------+-------+
| Field                | Type         | Null | Key  | Default | Extra |
+----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME        | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME           | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME          | varchar(512) | NO   |      | NULL    |       |
| ROW_COUNT            | bigint(21)   | NO   |      | NULL    |       |
| MIN_VALUE_SERIALIZED | varchar(255) | YES  |      | NULL    |       |
| MAX_VALUE_SERIALIZED | varchar(255) | YES  |      | NULL    |       |
| NULL_COUNT           | bigint(21)   | NO   |      | NULL    |       |
| CARDINALITY          | bigint(21)   | NO   |      | NULL    |       |
| DENSITY              | double       | NO   |      | NULL    |       |
| ADVANCED_HISTOGRAMS  | bigint(21)   | NO   |      | NULL    |       |
| LEGACY_HISTOGRAMS    | bigint(21)   | NO   |      | NULL    |       |
| RANGE_STATS          | bigint(21)   | NO   |      | NULL    |       |
| SAMPLE_SIZE          | bigint(21)   | NO   |      | NULL    |       |
+----------------------+--------------+------+------+---------+-------+
13 rows in set (0.01 sec)

The RANGE_STATS column indicates whether any kind of histogram (advanced or legacy) is stored for this column. The LEGACY_HISTOGRAMS and ADVANCED_HISTOGRAMS columns indicate presence of legacy and new histograms, respectively. Some of the columns including MIN_VALUE_SERIALIZED, MAX_VALUE_SERIALIZED, and DENSITY are currently unused.

Histograms

The ANALYZE command can additionally collect histograms over designated columns. When histograms are enabled for a column, ANALYZE will sample the rows in that column and create a histogram from the samples, recording the endpoints of each bucket and statistical information about the distribution of values within each bucket.

When collected, histograms can be queried through the information_schema.ADVANCED_HISTOGRAMS table:

memsql> DESCRIBE INFORMATION_SCHEMA.ADVANCED_HISTOGRAMS;
+-------------------------+---------------+------+------+---------+-------+
| Field                   | Type          | Null | Key  | Default | Extra |
+-------------------------+---------------+------+------+---------+-------+
| DATABASE_NAME           | varchar(512)  | NO   |      | NULL    |       |
| TABLE_NAME              | varchar(512)  | NO   |      | NULL    |       |
| COLUMN_NAME             | varchar(512)  | NO   |      | NULL    |       |
| TYPE                    | varchar(255)  | NO   |      | NULL    |       |
| VERSION                 | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_COUNT            | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_INDEX            | bigint(21)    | NO   |      | NULL    |       |
| RANGE_MIN               | varchar(8192) | YES  |      | NULL    |       |
| RANGE_MAX               | varchar(8192) | YES  |      | NULL    |       |
| UNIQUE_COUNT            | double        | YES  |      | NULL    |       |
| CARDINALITY             | double        | YES  |      | NULL    |       |
| UNIQUE_COUNT_CUMULATIVE | double        | YES  |      | NULL    |       |
| CARDINALITY_CUMULATIVE  | double        | YES  |      | NULL    |       |
+-------------------------+---------------+------+------+---------+-------+
13 rows in set (0.00 sec)

The table will display, in human-readable form, the end points of the histogram buckets along with total and unique count for each bucket. Bucket index -1 corresponds to the number of nulls tracked by the histogram.

If you are using legacy histograms (using cardinality_estimation_level set to 6.0), information about the histograms is shown in information_schema.RANGE_STATISTICS.

Missing statistics warnings

When you run EXPLAIN on a query, MemSQL will show information about any missing statistics that the optimizer could take advantage of, and the ANALYZE commands that you can run to collect those statistics.

The missing statistics listed here are statistics that are not present that the query optimizer would be able to use if collected. As discussed above in the ANALYZE command section, collecting column statistics is strongly recommended, and collecting histograms is recommended in most situations.

For example, for the following query, the EXPLAIN output shows that the optimizer is missing column statistics on both tables t and t2 as well as histograms on t.x.

memsql> CREATE TABLE t (id BIGINT PRIMARY KEY, x DOUBLE);
Query OK, 0 rows affected (0.17 sec)

memsql> CREATE TABLE t2 (id BIGINT PRIMARY KEY, x DOUBLE);
Query OK, 0 rows affected (0.17 sec)

memsql> EXPLAIN SELECT * FROM t, t2 WHERE t.id = t2.id AND t.x > 1;
+---------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                     |
+---------------------------------------------------------------------------------------------+
| WARNING: Missing statistics on some tables, consider running the following commands:        |
|     ANALYZE TABLE db.`t2`;                                                                  |
|     ANALYZE TABLE db.`t` COLUMNS `x` ENABLE;                                                |
| See http://docs.memsql.com/v5.0/docs/analyze for more information on statistics collection. |
|                                                                                             |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| Gather partitions:all                                                                       |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| NestedLoopJoin                                                                              |
| |---IndexSeek db.t2, PRIMARY KEY (id) scan:[id = t.id]                                      |
| Filter [t.x > 1]                                                                            |
| TableScan db.t, PRIMARY KEY (id)                                                            |
+---------------------------------------------------------------------------------------------+
12 rows in set (0.14 sec)

Statistics lifetime

Statistics are tied to a base table, and are persisted through ALTER TABLE. Statistics are also persisted through server restarts. Statistics are deleted upon running DROP TABLE or ANALYZE TABLE t DROP.