Outdated Version

You are viewing an older version of this section. View current production version.

Statistics and ANALYZE

MemSQL collects statistics on table data to facilitate accurate query optimization. The query optimizer uses these statistics to help choose the best execution plan for each query.

Some statistics are collected and updated automatically in the background, while other statistics are collected by running the ANALYZE command. The ANALYZE command also is used to trigger plan re-optimization when data statistics change.

MemSQL uses two types of statistics (see the Types of Statistics section for more detail):

  • Column statistics, which includes information on the cardinality (number of distinct values) of a column.
  • Range statistics, i.e. histograms, which provide information on the distribution of data in a column.

Collecting column statistics is strongly recommended for optimal query performance. Collecting range statistics is recommended in most situations - see the Range Statistics vs. Sampling section.

Automatic statistics (abbreviated autostats) are collected and updated automatically in the background. Non-automatic statistics are collected by running the ANALYZE command. In MemSQL 6.0 and later, column statistics on columnstore tables are automatic, while column statistics on rowstore tables and all range statistics are currently non-automatic. In MemSQL 5.x and earlier, all statistics are non-automatic.

The ANALYZE command and non-automatic statistics

In MemSQL 6.0 and later, non-automatic statistics include column statistics on rowstore tables and all range statistics (on both rowstore and columnstore tables). In versions prior to 6.0, all statistics are non-automatic.

Non-automatic statistics for a table are collected and updated by running the ANALYZE command:

ANALYZE TABLE table_name;

This also triggers plan invalidation and re-optimization when statistics change, as described in the Plan Invalidation section (note that this applies to both automatic and non-automatic statistics).

ANALYZE collects column statistics on every column (unless the table is using automatic statistics), as well as range statistics on chosen columns. You may designate which columns to collect range statistics on with the ANALYZE TABLE table_name COLUMNS ... {ENABLE | DISABLE} commands, as described below.

Non-automatic statistics are not updated when the table is modified through DML statements, including INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE. They are only collected and updated when the ANALYZE command is called, which completely overwrites any previous statistics collected on the table.

Warning
  • ANALYZE may invalidate plans, i.e. cause them to be reoptimized and recompiled the next time they are run, when data statistics change and a new plan could be faster. See the Plan Invalidation section for more details.

  • Collecting non-automatic statistics with ANALYZE can be expensive for large tables, especially when collecting range statistics. Because non-automatic statistics are collected only when you run ANALYZE, you can choose when to run it to best manage the performance impact.

  • ANALYZE does not block concurrent queries and can be safely used while running concurrent queries. Any query compilations will use the present statistics until the ANALYZE completes, at which point query compilations atomically switch to using the new statistics.

Notes

All forms of ANALYZE are subject to the following restrictions:

  • ANALYZE must be run on the master aggregator node.
  • ANALYZE requires ALTER and SELECT permissions on the target table.
  • As usual, you may specify USING db_name prior to ANALYZE, as in [USING db_name] ANALYZE ..., where db_name is a MemSQL database. The context database of the connection is used if the USING clause is omitted.
  • table_name must be the name of a table in the chosen database.
  • column_name must refer to a column in the table table_name.

The ANALYZE command has the following forms:

ANALYZE TABLE table_name;

Collect column statistics on all columns in table_name if it is not autostats-enabled, and collect range statistics over all previously designated columns. Invalidate all plans on this table which were compiled with significantly different statistics than the newly observed statistics.

ANALYZE TABLE table_name COLUMNS column_name [, ...] ENABLE;

Designate the specified columns to have range statistics collected, in addition to any previously designated columns. Then collects statistics and invalidates plans the same way as ANALYZE TABLE table_name.

ANALYZE TABLE table_name COLUMNS column_name [, ...] DISABLE;

Delete range statistics on specified columns if present, and designate the specified columns to not have range statistics collected in the future. Does not collect any statistics.

ANALYZE TABLE table_name DROP;

Delete all stored non-automatic statistics for a table, and disables histograms on any columns for which they have been enabled. Invalidates all plans on the table. Does not delete automatic statistics.

Exporting and importing statistics

It is also possible to export and import statistics. This can be useful for testing purposes - for example, you can copy statistics from one MemSQL cluster to another, in order to emulate the optimizer’s behavior on a test system that does not have the same data.

ANALYZE TABLE table_name INTO OUTFILE '/path/file.json';

Save column and range statistics recorded on table_name into the file /path/file.json in a JSON representation. You can use absolute or relative paths; relative paths are rooted in the data directory of the MemSQL instance.

Note that this exports the current statistics; it does not collect new statistics. It uses whatever set of statistics would currently be used by for query optimization - for column statistics on autostats-enabled tables, this would be the automatic statistics if they are enabled and up-to-date, and otherwise the statistics last saved by ANALYZE.

ANALYZE TABLE table_name INFILE '/path/file.json';

Import range and column statistics for table_name from the file /path/file.json. You can use absolute or relative paths; relative paths are rooted in the data directory of the MemSQL instance.

This command is disallowed if autostats is enabled on the table. If you wish to use imported column statistics, you should disable autostats and then import the statistics. If you wish to use imported range statistics only on an autostats-enabled table, you can temporarily disable autostats, run ANALYZE ... INFILE to import the statistics, and re-enable autostats.

Configuration parameters export and import

In addition, you can export and import configuration parameters used by the optimizer.

ANALYZE OPTIMIZER_STATE INTO OUTFILE 'file.json';
ANALYZE OPTIMIZER_STATE INFILE 'file.json';

Plan Invalidation

When data statistics change, an old query plan optimized based on the old data statistics may no longer be the best for the new data statistics. It is often desirable to re-optimize the query plan after the data statistics change significantly, as a new plan may be significantly better than the old plan. Running ANALYZE triggers re-optimization of any plans based on statistics for that table that are significantly different than the current statistics, by invalidating each such plan and recompiling the corresponding query the next time it is run.

Plans are not invalidated unless you trigger invalidation by running ANALYZE. Even if automatic statistics change enough that a plan meets the threshold for plan invalidation eligibility, it will not be invalidated until you run ANALYZE. This is to prevent unexpected recompilation of plans when data changes - you must explicitly trigger recompilation by running ANALYZE.

MemSQL uses changes in the table rowcount as a heuristic for significant changes in statistics: a plan is eligible for invalidation if there is a table rowcount differing by a factor of 2 between the statistics it was compiled with and the current statistics, i.e. if an old rowcount (from the statistics at the time the query was compiled) is less than 50% or greater than 200% of the current rowcount. When you run ANALYZE, all plans eligible for invalidation (i.e. those where the current statistics and compile-time statistics differ in any table rowcount by at least 2x) are invalidated, i.e. they will be reoptimized and recompiled the next time they are run.

In addition, if autostats on a table were outdated when the plan was compiled, the plan will be eligible for invalidation after autostats become up-to-date (i.e. after autostats become up-to-date and you run ANALYZE, it will be invalidated).

Automatic statistics

Automatic statistics (abbreviated autostats) provides automatic maintenance of column statistics. Statistics are collected and incrementally updated in the background as the table is modified with DML statements such as INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE, as well as ALTER TABLE.

In MemSQL 6.0 and later, column statistics on columnstore tables are automatic by default. Automatic statistics collection is not currently supported for rowstore tables or for range statistics. In versions prior to 6.0, all statistics are non-automatic.

While statistics are updated automatically, to re-optimize query plans after statistics change, you must trigger plan invalidation by running ANALYZE - see the Plan Invalidation section.

Running ANALYZE also saves a copy of the table’s current automatic statistics. If you disable automatic statistics or they become out-of-date (see below), the query optimizer will use the last set of statistics saved by ANALYZE instead.

Autostats support upon upgrading MemSQL

For MemSQL clusters upgraded from 5.8 and earlier to 6.0 and later, autostats will remain disabled by default on existing columnstore tables created prior to the upgrade, while autostats will be enabled by default on new columnstore tables created after the upgrade. To enable autostats on existing columnstore tables created prior to the upgrade, which we recommend in most cases, run ALTER TABLE table_name ENABLE AUTOSTATS on each table. As discussed in Disabling and enabling autostats, this will begin the process of collecting automatic statistics in the background.

Info

Below is a simple script you can use to enable autostats on all tables (written in bash to minimize dependencies).

Save this script in a file, which we will call enable-autostats.sh:

#!/bin/bash

# Enables autostats on all columnstore tables which currently have autostats disabled
#
# Run this bash script with the mysql client arguments you use to connect to the MemSQL master aggregator, like:
# bash enable-autostats.sh -h<host> -P<port> -u<user> -p<password>
#
# Use the --dry-run option to only print the list of tables which would be altered, without actually running the ALTER commands.

set -e

client_args=(mysql --batch --skip-column-names)
dry_run=false

while [[ $# -gt 0 ]]
do
    case "$1" in
        --dry-run)
            dry_run=true
            ;;
        *)
            client_args+=("$1")
            ;;
    esac
    shift
done

# List of table names (fully qualified and quoted)
tables=$(${client_args[@]} -e "select concat(table_schema, '.\`', replace(table_name, '\`', '\`\`'), '\`') from information_schema.tables where table_type = 'BASE TABLE'")

echo "$tables" | while read -r table
do
    show_create_table=$(${client_args[@]} -e "show create table $table")
    if echo "$show_create_table" | grep -q "AUTOSTATS_ENABLED=FALSE"
    then
        echo "ALTER TABLE $table ENABLE AUTOSTATS"
        if [[ "$dry_run" == false ]]
        then
            ${client_args[@]} -e "ALTER TABLE $table ENABLE AUTOSTATS"
        fi
    fi
done

Run this bash script with the mysql client arguments you use to connect to the MemSQL master aggregator, like:

bash enable-autostats.sh -h<host> -P<port> -u<user> -p<password>

Disabling and enabling autostats

You can disable or enable autostats on each table. If you disable autostats on a table, statistics collection for that table becomes non-automatic, i.e. column statistics will be collected with the ANALYZE command. If autostats is re-enabled, automatic statistics collection will resume (see the next section for details). We do not recommend disabling autostats in most situations.

You can disable autostats on an existing table by running

ALTER TABLE table_name DISABLE AUTOSTATS

and enable autostats by running

ALTER TABLE table_name ENABLE AUTOSTATS

You can choose whether autostats are enabled or disabled when creating a table by adding the option

AUTOSTATS_ENABLED=<TRUE|FALSE>

at the end of the CREATE TABLE statement, such as:

CREATE TABLE t (a int, b int, key (a) using clustered columnstore) AUTOSTATS_ENABLED=FALSE

Additionally, you can change whether autostats is enabled or disabled by default on future CREATE TABLE statements by setting the global variable default_autostats_enabled to TRUE or FALSE on the master aggregator. The default is TRUE.

You can see whether autostats is enabled or disabled on a table by running SHOW CREATE TABLE.

Possible reasons to disable autostats include:

  • Automatic statistics requires a small memory overhead per column to maintain the statistics. If you have a large number of columns across all the columnstore tables in your database, this memory overhead may be significant.
  • Updating automatic statistics requires a small performance cost on writes to a table. You may wish to consider disabling autostats if you find this write performance cost to be significant on your workload, can achieve better overall performance with non-automatic statistics, and are willing to handle the operational requirements of managing non-automatic statistics.
  • If you want to “freeze” the statistics, so that the query optimizer continues to use a fixed set of statistics regardless of how the data is changing, you can do so by disabling automatic statistics and using ANALYZE to manage statistics.

Keep in mind that if you disable autostats, there are several potential downsides:

  • If the statistics become stale, the query optimizer may choose bad query plans with much worse execution performance.
  • You will have to update statistics as necessary by running ANALYZE.
  • Collecting non-automatic statistics with ANALYZE can be very costly for large tables.

Statistics population after enabling autostats

Immediately after enabling autostats on a table where it was previously disabled, MemSQL will start collecting automatic statistics in the background until it catches up with the current table data. During this time, the query optimizer will use the previous non-automatic statistics if available, until the autostats are sufficiently up-to-date - see the next section for details.

This background statistics collection uses limited resources to avoid large impacts on the performance of other workloads. Therefore, collecting automatic statistics may take a long time for large datasets, but this only affects the first time statistics are populated after autostats is enabled on a table where it was previously disabled.

Out-of-date autostats

If there is too large a difference between the data which MemSQL has gathered statistics on and the actual table data, such as immediately after autostats is enabled, autostats will be considered out-of-date. As soon as the difference becomes small enough, autostats will be considered up-to-date. While autostats are out-of-date, the query optimizer uses the statistics saved the last time ANALYZE was run on the table, if any. EXPLAIN will show a warning that autostats is out of date.

After statistics have been updated, you may run ANALYZE TABLE to trigger plan invalidation and re-optimization of any query plans that were based on stale statistics - see the Plan Invalidation section.

Types of Statistics

Column Statistics

Column statistics are collected on every column in a table when the table is processed by automatic statistics or the ANALYZE command. They include the number of distinct values in each column, the number of NULLs in each column, and the number of rows in the table.

You can view the column statistics last collected by ANALYZE by querying the INFORMATION_SCHEMA.OPTIMIZER_STATISTICS table. Note that this table shows the statistics when ANALYZE was last run on this table - for tables with automatic statistics, this table will reflect the statistics at the last time you ran ANALYZE, which means this table may show older statistics than the current automatic statistics, and it will not show any statistics if you have never run ANALYZE.

memsql> describe INFORMATION_SCHEMA.OPTIMIZER_STATISTICS;
+----------------------+--------------+------+------+---------+-------+
| Field                | Type         | Null | Key  | Default | Extra |
+----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME        | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME           | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME          | varchar(512) | NO   |      | NULL    |       |
| ROW_COUNT            | bigint(21)   | NO   |      | NULL    |       |
| MIN_VALUE_SERIALIZED | varchar(255) | YES  |      | NULL    |       |
| MAX_VALUE_SERIALIZED | varchar(255) | YES  |      | NULL    |       |
| NULL_COUNT           | bigint(21)   | NO   |      | NULL    |       |
| CARDINALITY          | bigint(21)   | NO   |      | NULL    |       |
| DENSITY              | double       | NO   |      | NULL    |       |
| RANGE_STATS          | bigint(21)   | NO   |      | NULL    |       |
| SAMPLE_SIZE          | bigint(21)   | NO   |      | NULL    |       |
+----------------------+--------------+------+------+---------+-------+
11 rows in set (0.00 sec)

The RANGE_STATS column indicates whether range statistics are stored for this column. Some of the columns including MIN_VALUE_SERIALIZED, MAX_VALUE_SERIALIZED, and DENSITY are currently unused.

Range Statistics

The ANALYZE command can additionally collect histograms over designated columns. When histograms are enabled for a column, ANALYZE will sample the rows in that column and create a histogram from the samples, recording the endpoints of each bucket and statistical information about the distribution of values within each bucket.

When collected, range histograms can be queried through the INFORMATION_SCHEMA.RANGE_STATISTICS table:

memsql> describe INFORMATION_SCHEMA.RANGE_STATISTICS
+-------------------+--------------+------+------+---------+-------+
| Field             | Type         | Null | Key  | Default | Extra |
+-------------------+--------------+------+------+---------+-------+
| DATABASE_NAME     | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME        | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME       | varchar(512) | NO   |      | NULL    |       |
| BUCKET_ID         | bigint(21)   | NO   |      | NULL    |       |
| MIN_VALUE         | varchar(255) | NO   |      | NULL    |       |
| MAX_VALUE         | varchar(255) | YES  |      | NULL    |       |
| TO_LEFT           | bigint(21)   | NO   |      | NULL    |       |
| TO_RIGHT          | bigint(21)   | NO   |      | NULL    |       |
| DISTINCT_INTERIOR | bigint(21)   | NO   |      | NULL    |       |
| EXCLUSIVE_DENSITY | double       | NO   |      | NULL    |       |
+-------------------+--------------+------+------+---------+-------+
10 rows in set (0.00 sec)

The values for the endpoints are stored in a serialized form (an order-preserving mapping from their original type to lexicographically ordered bytestrings), and displayed in the columns MIN_VALUE and MAX_VALUE in a BASE64-encoded representation.

Missing Statistics Warnings

When you run EXPLAIN on a query, MemSQL will show information about any missing statistics that the optimizer could take advantage of, and the ANALYZE commands that you can run to collect those statistics.

This is especially useful to identify which columns MemSQL could use range statistics on, because identifying those manually could be difficult and error-prone.

For example, for the following query, the EXPLAIN output shows that the optimizer is missing column statistics on both tables t and t2 as well as range statistics on t.x.

memsql> create table t (id bigint primary key, x double);
Query OK, 0 rows affected (0.17 sec)

memsql> create table t2 (id bigint primary key, x double);
Query OK, 0 rows affected (0.17 sec)

memsql> explain select * from t, t2 where t.id = t2.id and t.x > 1;
+---------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                     |
+---------------------------------------------------------------------------------------------+
| WARNING: Missing statistics on some tables, consider running the following commands:        |
|     ANALYZE TABLE db.`t2`;                                                                  |
|     ANALYZE TABLE db.`t` COLUMNS `x` ENABLE;                                                |
| See http://docs.memsql.com/v5.0/docs/analyze for more information on statistics collection. |
|                                                                                             |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| Gather partitions:all                                                                       |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| NestedLoopJoin                                                                              |
| |---IndexSeek db.t2, PRIMARY KEY (id) scan:[id = t.id]                                      |
| Filter [t.x > 1]                                                                            |
| TableScan db.t, PRIMARY KEY (id)                                                            |
+---------------------------------------------------------------------------------------------+
12 rows in set (0.14 sec)

Range Statistics vs. Sampling

Column statistics are used to estimate group-by and join cardinalities and are almost always beneficial to collect. Range statistics (histograms) are used to estimate filter selectivity when available. If range statistics are unavailable, the optimizer will use dynamic sampling: it will sample the table data by scanning part of the table.

Range statistics can make query optimization much faster and more accurate:

  • Range statistics are more accurate for individual predicates (e.g. x > 1000), since they are collected over all partitions, whereas sampling sends queries to only one partition, potentially leading to very skewed estimates.
  • Range statistics are much faster to use than sampling during query optimization, since the optimizer merely needs to read the previously collected statistics. In contrast, sampling requires scanning a large amount of table data.

However, in some cases sampling can yield more accurate results:

  • Range statistics can only estimate predicates of the form <column> =/<>/>/</>=/<= <constant> combined via AND and OR. When range statistics are present for all such simple predicates, predicates involving more complex expressions (like LIKE and IN) are estimated via heuristics which are oblivious to the actual distribution of data in the table.
  • Since histograms are one-dimensional, they cannot capture correlation between columns: Suppose column x is uniformly distributed between 0 and 1, and column y = x + 1. If histograms are collected on both x and y, the optimizer will estimate the selectivity of x > 0.5 AND y > 1.5 as P[x > 0.5] * P[y > 1.5] = 0.25. If histograms are absent for x or y, the optimizer will run the query SELECT COUNT(*) FROM t WHERE x > 0.5 AND y > 0.5 and discover that the selectivity is actually 0.5.

In summary, range statistics are inaccurate in the presence of correlated predicates, whereas sampling is inaccurate in the presence of data skew across partitions.

Range statistics are more expensive to collect than column statistics and are only useful on columns that are involved in single-table equality or range predicates. In addition, the storage cost of range statistics is proportional to the size of the values in the column they’re collected over, unlike column statistics which take constant memory to store per column.

Statistics lifetime

Statistics are tied to a base table, and are persisted through ALTER TABLE. Statistics are also persisted through server restarts. Statistics are deleted upon running DROP TABLE or ANALYZE TABLE t DROP.

New automatic statistics features in MemSQL 6.0

MemSQL 6.0 introduces automatic statistics: MemSQL automatically collects and updates column statistics on columnstore tables, incrementally in the background. This greatly reduces the operational burden around maintaining these statistics with ANALYZE, ensures they are available without requiring user action, and keeps them up-to-date as table data changes with less performance impact.

For column statistics on columnstore tables:

  • Statistics are now collected automatically by default. Previously, if you never ran ANALYZE, no statistics would be available.
  • Statistics are automatically and incrementally updated as the table data changes.
    • Previously, ANALYZE re-collected all statistics from scratch, which can be a very expensive operation on large tables, and can impact other workloads on the database. Now the cost of updating statistics is amortized across writes, and there is no unnecessary work to re-collect statistics on parts of the table that aren’t changing.
    • By maintaining statistics incrementally, MemSQL can keep statistics much more up-to-date than with a batch statistics collection approach.

Automatic statistics are not available for range statistics and statistics on rowstore tables - those statistics are still non-automatic as in versions 5.x, and you still use ANALYZE to collect them. You also still use ANALYZE to trigger plan invalidation.

Autostats is enabled by default on tables created in MemSQL 6.0 and later. However, after upgrading an existing MemSQL cluster from 5.8 and earlier to 6.0 and later, autostats will remain disabled by default on existing tables created prior to the upgrade. To enable autostats on existing columnstore tables created prior to the upgrade, which we recommend in most cases, run ALTER TABLE table_name ENABLE AUTOSTATS on each table. See the Autostats support upon upgrading MemSQL section.

For more details, see the Automatic statistics section.