Outdated Version

You are viewing an older version of this section. View current production version.

Statistics and ANALYZE

MemSQL collects statistics on table data to facilitate accurate query optimization. The query optimizer uses these statistics to help choose the best execution plan for each query.

Some statistics are collected and updated automatically in the background, while other statistics are collected by running the ANALYZE command. The ANALYZE command also is used to trigger plan re-optimization when data statistics change.

MemSQL uses two types of statistics (see the Types of statistics section for more detail):

  • Column statistics, which includes information on the cardinality (number of distinct values) of a column.
  • Histograms (also known as range statistics), which provide information on the distribution of data in a column.

Collecting column statistics is strongly recommended for optimal query performance. Collecting histograms is recommended in most situations. See the ANALYZE command section for more details.

Automatic statistics (abbreviated autostats) are collected and updated automatically in the background. Non-automatic statistics are collected by running the ANALYZE command.

The ANALYZE Command

Non-automatic statistics for a table are collected and updated by running the ANALYZE command:

ANALYZE TABLE table_name;

In MemSQL 6.0 and later, non-automatic statistics include column statistics on rowstore tables and all histograms (on both rowstore and columnstore tables). In versions prior to 6.0, all statistics are non-automatic.

ANALYZE also triggers plan invalidation and re-optimization when statistics change, as described in the Plan invalidation section (note that this applies to both automatic and non-automatic statistics).

ANALYZE collects column statistics on every column (unless the table is using automatic statistics), as well as histograms on chosen columns. You may designate which columns to collect histograms on with the ANALYZE TABLE table_name COLUMNS ... {ENABLE | DISABLE} commands, as described below.

Collecting column statistics is strongly recommended for optimal query performance. Collecting histograms is recommended in most situations. See the Histograms vs. sampling section for more details.

Non-automatic statistics are not updated when the table is modified through DML statements, including INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE. They are only collected and updated when the ANALYZE command is called, which completely overwrites any previous statistics collected on the table.

Warning
  • ANALYZE may invalidate plans, i.e. cause them to be reoptimized and recompiled the next time they are run, when data statistics change and a new plan could be faster. See the Plan invalidation section for more details.

  • Collecting non-automatic statistics with ANALYZE can be expensive for large tables, especially when collecting histograms. Because non-automatic statistics are collected only when you run ANALYZE, you can choose when to run it to best manage the performance impact.

  • ANALYZE does not block concurrent queries and can be safely used while running concurrent queries. Any query compilations will use the present statistics until the ANALYZE completes, at which point query compilations atomically switch to using the new statistics.

Notes

All forms of ANALYZE are subject to the following restrictions:

  • ANALYZE must be run on the master aggregator node.
  • ANALYZE requires ALTER and SELECT permissions on the target table.
  • table_name must be the name of a table in the chosen database.
  • column_name must refer to a column in the table table_name.

The ANALYZE command has the following forms:

ANALYZE TABLE table_name;

Collect column statistics on all columns in table_name if it is not autostats-enabled, and collect histograms over all previously designated columns. Invalidate all plans on this table which were compiled with significantly different statistics than the newly observed statistics.

ANALYZE TABLE table_name COLUMNS column_name [, ...] ENABLE;

Designate the specified columns to have histograms collected, in addition to any previously designated columns. Then collects statistics and invalidates plans the same way as ANALYZE TABLE table_name.

ANALYZE TABLE table_name COLUMNS ALL ENABLE;

Designate all columns to have histograms collected, in addition to any previously designated columns. If the table contains columns which do not support histograms, a warning will be raised for each such column. Collects statistics and invalidates plans the same way as ANALYZE TABLE table_name.

Because this command creates a histogram on every column of the table, and collects histogram information for each new histogram, please note the following:

  • On very wide tables, this can be expensive to collect, increasing the execution time of the ANALYZE command.

  • The histograms are stored in memory. Make sure to plan for the extra memory enabling those histograms will consume.

  • Histograms are not always useful to have. Query the information_schema.mv_prospective_histograms table to check if a column has been identified from your workload, where having a histogram on that column would be beneficial to your queries.

  • In some cases sampling stats are better than histograms, see Histograms vs. Sampling for more information.

Instead, on clusters on v7.0+ we recommend using the default behavior: AUTOSTATS_HISTOGRAM_MODE=CREATE

With AUTOSTATS_HISTOGRAM_MODE set to CREATE, if a column is identified at least twice in information_schema.mv_prospective_histograms that having a histogram on it could be beneficial, then the histogram will be automatically created, and updated at that time. Like any histograms you create, it is updated when you run ANALYZE, and also as part of the periodic autostats (same interval: 10 minutes, and actually collected as part of the same process).

ANALYZE TABLE table_name COLUMNS column_name [, ...] DISABLE;

Delete histograms on specified columns if present, and designate the specified columns to not have histograms collected in the future. Does not collect any statistics.

ANALYZE TABLE table_name COLUMNS ALL DISABLE;

Delete histograms on all columns of the table if present, and designate all columns to not have histograms collected in the future. If the table contains columns which do not support histograms, a warning will be raised for each such column. Does not collect any statistics.

ANALYZE TABLE table_name DROP;

Delete all stored non-automatic statistics for a table, and disables histograms on any columns for which they have been enabled. Invalidates all plans on the table which were compiled using statistics. Does not delete automatic statistics.

Exporting and Importing Statistics

It is also possible to export and import statistics. This can be useful for testing purposes - for example, you can copy statistics from one MemSQL cluster to another, in order to emulate the optimizer’s behavior on a test system that does not have the same data.

ANALYZE TABLE table_name INTO OUTFILE '/path/file.json';

Save column and histograms recorded on table_name into the file /path/file.json in a JSON representation. You can use absolute or relative paths; relative paths are rooted in the data directory of the MemSQL instance.

Note that this exports the current statistics; it does not collect new statistics. It uses whatever set of statistics would currently be used by for query optimization - for column statistics on autostats-enabled tables, this would be the automatic statistics if they are enabled and up-to-date, and otherwise the statistics last saved by ANALYZE.

ANALYZE TABLE table_name INFILE '/path/file.json';

Import range and column statistics for table_name from the file /path/file.json. You can use absolute or relative paths; relative paths are rooted in the data directory of the MemSQL instance.

This command is disallowed if autostats is enabled on the table. If you wish to use imported column statistics, you should disable autostats and then import the statistics. If you wish to use imported histograms only on an autostats-enabled table, you can temporarily disable autostats, run ANALYZE ... INFILE to import the statistics, and re-enable autostats.

Configuration Parameters Export and Import

You can also export and import configuration parameters (including session and global variables) used by the optimizer via a .json file. Session variables are exported from or imported into the current session, while global variables are saved and set globally.

ANALYZE OPTIMIZER_STATE INTO OUTFILE '</path/file.json>';
ANALYZE OPTIMIZER_STATE INFILE '</path/file.json>';

INTO OUTFILE stores the configuration parameters into the specified file.

INFILE specifies a file from which to import configuration parameters.

Note: Using INFILE to set the configuration will update global variables. This can impact other MemSQL sessions.

Plan Invalidation

When data statistics change, an old query plan optimized based on the old data statistics may no longer be the best for the new data statistics. It is often desirable to re-optimize the query plan after the data statistics change significantly, as a new plan may be significantly better than the old plan. Running ANALYZE triggers re-optimization of any plans based on statistics for that table that are significantly different than the current statistics, by invalidating each such plan and recompiling the corresponding query the next time it is run.

Plans can be invalidated manually by running the ANALYZE command. To automatically invalidate the plan as autostats update, set the enable_background_plan_invalidation variable to ON. By default, this variable is set to OFF to prevent unexpected recompilation of plans when data changes.

MemSQL uses changes in the table rowcount as a heuristic for significant changes in statistics: a plan is eligible for invalidation if there is a table rowcount differing by a factor of 2 between the statistics it was compiled with and the current statistics, i.e. if an old rowcount (from the statistics at the time the query was compiled) is less than 50% or greater than 200% of the current rowcount. When you run ANALYZE, all plans eligible for invalidation (i.e. those where the current statistics and compile-time statistics differ in any table rowcount by at least 2x) are invalidated, i.e. they will be reoptimized and recompiled the next time they are run.

In addition, if autostats on a table were outdated when the plan was compiled, the plan will be eligible for invalidation after autostats become up-to-date (i.e. after autostats become up-to-date and you run ANALYZE, the plan will be invalidated).

Automatic Statistics

MemSQL automatically collects and updates column statistics on columnstore tables, incrementally in the background. This greatly reduces the operational burden around maintaining these statistics with ANALYZE, ensures they are available without requiring user action, and keeps them up-to-date as table data changes with less performance impact.

Automatic statistics (abbreviated autostats) provides automatic maintenance of column statistics. Statistics are collected and incrementally updated in the background as the table is modified with DML statements such as INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE, as well as ALTER TABLE.

While statistics are updated automatically, to re-optimize query plans after statistics change, you must trigger plan invalidation by running ANALYZE - see the Plan invalidation section.

Running ANALYZE also saves a copy of the table’s current automatic statistics. If you disable automatic statistics or they become out-of-date (see below), the query optimizer will use the last set of statistics saved by ANALYZE instead.

Autostats in MemSQL Versions

In versions prior to 6.0, all statistics are non-automatic. In MemSQL 6.0 and later, column statistics on columnstore tables are automatic by default. In MemSQL 7.0 and later, all statistics are automatic by default on all tables.

Incremental Autostats

Incremental autostats are available in MemSQL 6.0 and later and are supported only on columnstore tables. The incremental autostats gathering process updates the column statistics of a table incrementally as the table’s data changes. In MemSQL 6.0 and later, incremental autostats are enabled by default on all columnstore tables.

Enhancements to Autostats in MemSQL 7.0

MemSQL 7.0 contains enhancements to column statistics, histograms, and sampling:

Periodic Autostats for Column Statistics

Periodic automatic statistics gathering (periodic autostats) is available in MemSQL 7.0 and later and is supported on both columnstore and rowstore tables. Periodic autostats collects column statistics on a table over a fixed period, but only after the table’s data has changed significantly. To collect these statistics, the MemSQL engine automatically runs ANALYZE on the table, in the background. As ANALYZE can be an expensive operation, total resource usage by periodic autostats is limited according to the system_optimizer_pool variable. In MemSQL 7.0 and later, periodic autostats collection is enabled by default on all rowstore tables.

Incremental autostats are also available in MemSQL 7.0, as they were in previous versions. Since periodic autostats collect the same column statistics as do incremental autostats, only one can be enabled on a single table at a time.

Histogram Enhancements

Starting in MemSQL 7.0, histograms are created and updated automatically on all tables by default. When a table is first created, there are no histograms on any column. When queries are run against the table, the query optimizer queues histogram creation on columns where those histograms would be beneficial for optimizing the queries. As the table’s data changes, the created histograms are kept up to date as described in Periodic Autostats.

Uniform Random Sampling

Uniform random sampling for rowstore tables has been available in MemSQL since version 6.7. Starting in MemSQL 7.0, uniform random sampling is available for columnstore tables. By default, the optimizer uses this sampling on all tables. For rowstore tables, this sampling is implicit and no overhead is required to maintain it. For columnstore tables, MemSQL incrementally maintains the sample and keeps it up to date as the table’s data changes.

Autostats Support upon Upgrading MemSQL

Any table that is upgraded to a newer version of MemSQL retains its autostats settings. If the previous MemSQL version did not have autostats support, then the upgraded tables will have autostats disabled upon upgrade. Also, if a table had autostats enabled prior to upgrade, the same autostats will be enabled on the table post upgrade. Autostats will be enabled by default for any new table created in MemSQL version 7.0. To enable autostats on existing columnstore tables created prior to the upgrade, which we recommend in most cases, run ALTER TABLE table_name ENABLE AUTOSTATS on each table. As discussed in Disabling and enabling autostats, this will begin the process of collecting automatic statistics in the background.

Below is a simple script you can use to enable autostats on all tables (written in bash to minimize dependencies).

Save this script in a file called enable-autostats.sh:

#!/bin/bash

# Enables autostats on all columnstore tables which currently have autostats disabled
#
# Run this bash script with the mysql client arguments you use to connect to the MemSQL master aggregator, like:
# bash enable-autostats.sh -h<host> -P<port> -u<user> -p<password>
#
# Use the --dry-run option to only print the list of tables which would be altered, without actually running the ALTER commands.

set -e

client_args=(mysql --batch --skip-column-names)
dry_run=false

while [[ $# -gt 0 ]]
do
    case "$1" in
        --dry-run)
            dry_run=true
            ;;
        *)
            client_args+=("$1")
            ;;
    esac
    shift
done

# List of table names (fully qualified and quoted)
tables=$(${client_args[@]} -e "select concat(table_schema, '.\`', replace(table_name, '\`', '\`\`'), '\`') from information_schema.tables where table_type = 'BASE TABLE'")

echo "$tables" | while read -r table
do
    show_create_table=$(${client_args[@]} -e "show create table $table")
    alter_stmt=""
    if echo "$show_create_table" | grep -q "AUTOSTATS_CARDINALITY_MODE=OFF"
    then
        if echo "$show_create_table" | grep -q "COLUMNSTORE"
        then
            alter_stmt="${alter_stmt}AUTOSTATS_CARDINALITY_MODE=INCREMENTAL"
        else
            alter_stmt="${alter_stmt}AUTOSTATS_CARDINALITY_MODE=PERIODIC"
        fi
    fi
    if echo "$show_create_table" | grep -q "AUTOSTATS_HISTOGRAM_MODE=OFF"
    then
        if [[ "$alter_stmt" != "" ]]
        then
            alter_stmt="${alter_stmt},"
        fi
        alter_stmt="${alter_stmt}AUTOSTATS_HISTOGRAM_MODE=CREATE"
    fi
    if echo "$show_create_table" | grep -q "AUTOSTATS_SAMPLING=OFF"
    then
        if echo "$show_create_table" | grep -q "COLUMNSTORE"
        then
            if [[ "$alter_stmt" != "" ]]
            then
                alter_stmt="${alter_stmt},"
            fi
        fi
        alter_stmt="${alter_stmt}AUTOSTATS_SAMPLING=ON"
    fi
    if [[ "$alter_stmt" != "" ]]
    then
        alter_stmt="ALTER TABLE $table $alter_stmt"
        echo $alter_stmt
        ${client_args[@]} -e "$alter_stmt"
    fi
done

Run this bash script with the mysql client arguments you use to connect to the MemSQL master aggregator, like:

bash enable-autostats.sh -h<host> -P<port> -u<user> -p<password>

Disabling and Enabling Autostats

You can disable or enable autostats on each table. If you disable autostats on a table, statistics collection for that table becomes non-automatic, i.e. column statistics will be collected with the ANALYZE command. If autostats is re-enabled, automatic statistics collection will resume (see the next section for details). We do not recommend disabling autostats in most situations.

You can disable autostats on an existing table by running

ALTER TABLE table_name DISABLE AUTOSTATS

or

ALTER TABLE table_name AUTOSTATS_ENABLED = OFF

and enable autostats by running

ALTER TABLE table_name ENABLE AUTOSTATS

or

ALTER TABLE table_name AUTOSTATS_ENABLED = ON

You can choose whether autostats are enabled or disabled when creating a table by adding the option

AUTOSTATS_ENABLED=<TRUE|FALSE>

at the end of the CREATE TABLE statement, such as:

CREATE TABLE t (a int, b int, key (a) using clustered columnstore) AUTOSTATS_ENABLED=FALSE

When AUTOSTATS_ENABLED is set to TRUE, autostats categories are set to the following default settings:

  • For rowstore tables,

    AUTOSTATS_CARDINALITY_MODE = PERIODIC
    AUTOSTATS_HISTOGRAM_MODE = CREATE
    
  • For columnstore tables,

    AUTOSTATS_CARDINALITY_MODE = INCREMENTAL
    AUTOSTATS_HISTOGRAM_MODE = CREATE
    AUTOSTATS_SAMPLING = ON
    

MemSQL allows you to independently control how each category of statistics is automatically gathered.

AUTOSTATS_CARDINALITY_MODE = [INCREMENTAL|PERIODIC|OFF]
AUTOSTATS_HISTOGRAM_MODE = [CREATE|UPDATE|OFF]
AUTOSTATS_SAMPLING = [ON|OFF]

These autostats settings can also be specified or modified through the ALTER TABLE and CREATE TABLE commands. Multiple options can be combined in a single CREATE TABLE statement. For example,

ALTER TABLE table_name AUTOSTATS_CARDINALITY_MODE = PERIODIC;
CREATE TABLE t1 (a INT, b INT, KEY(a)) USING CLUSTERED COLUMNSTORE
AUTOSTATS_CARDINALITY_MODE = INCREMENTAL,
AUTOSTATS_HISTOGRAM_MODE = CREATE,
AUTOSTATS_SAMPLING= ON;
Warning

Each type of autostats setting may not work for all types of tables. For example, you cannot have incremental autostats on a rowstore table.

Additionally, you can change whether autostats is enabled or disabled by default on future CREATE TABLE statements by setting the global variable default_autostats_enabled to TRUE or FALSE on the master aggregator. The default is TRUE.

You can also modify the default settings for autostats, individually, by setting the following global variables:

default_autostats_columnstore_cardinality_mode = [INCREMENTAL|PERIODIC|OFF]
default_autostats_rowstore_cardinality_mode = [PERIODIC|OFF]
default_autostats_histogram_mode = [CREATE|UPDATE|OFF]
default_autostats_columnstore_sampling = [ON|OFF]
Info

Limitations associated with using autostats:

  • If a table has autostats enabled (AUTOSTATS_ENABLED = true), modifying autostats settings for the table will result in an error.
  • Enabling AUTOSTATS_HISTOGRAM_MODE (either mode besides OFF) requires enabling AUTOSTATS_CARDINALITY_MODE (either mode besides OFF).
  • INCREMENTAL cardinality mode is only supported on columnstore tables.
  • AUTOSTATS_SAMPLING is not allowed for rowstore tables. On columnstore tables, it controls whether the columnstore sample is built and used for estimation.

You can see whether autostats is enabled or disabled on a table by running SHOW CREATE TABLE.

Possible reasons to disable autostats include:

  • Incremental autostats require a small memory overhead per column to maintain the statistics. If you have a large number of columns across all the columnstore tables in your database, this memory overhead may be significant. In this case you can modify your table to use periodic autostats, as the memory overhead is significantly lower and it will still keep the statistics up to date.
  • In CREATE mode, automatic histograms may end up creating histograms on unwanted columns. As each histogram has a small memory overhead, you may choose to switch the histogram mode to UPDATE so that all existing histograms will be kept up to date, but no new histograms will be automatically created. You can still create any histograms manually using the ANALYZE command.
  • Histograms cannot be deleted in CREATE mode, as the same histogram may be created automatically in the future. To choose between specific columns for creating histograms, a table must be in UPDATE mode.
  • Incremental autostats require a small performance cost on writes to a table. You may wish to consider disabling autostats if you find this write performance cost to be significant on your workload, can achieve better overall performance with non-automatic statistics, and are willing to handle the operational requirements of managing non-automatic statistics.
  • Periodic autostats uses a small fixed amount of resources in the background. If the resources used negatively affect your workload, you may find it beneficial to disable autostats.
  • If you want to “freeze” the statistics, so that the query optimizer continues to use a fixed set of statistics regardless of how the data is changing, you can do so by disabling automatic statistics and using ANALYZE to manage statistics.

Keep in mind that if you disable autostats, there are several potential downsides:

  • In general, PERIODIC autostats run slower and use more CPU than INCREMENTAL autostats. If you do not have a problem with the memory overhead of INCREMENTAL autostats and are using a columnstore table, you should most likely use INCREMENTAL autostats.
  • Autostats collects statistics automatically by default, and updates statistics as the table data changes, but if you disable autostats, you will need to run ANALYZE to collect statistics, and you will have to update statistics as necessary by running ANALYZE.
  • If the statistics become stale, the query optimizer may choose bad query plans with much worse execution performance.
  • Collecting non-automatic statistics with ANALYZE can be very costly for large tables. ANALYZE re-collects all statistics from scratch, which can be a very expensive operation on large tables, and can impact other workloads on the database. With autostats, the cost of updating statistics is amortized across writes, and there is no unnecessary work to re-collect statistics on parts of the table that aren’t changing.
  • By maintaining statistics incrementally, autostats can keep statistics much more up-to-date compared to batch statistics collection with ANALYZE.

Statistics Population after Enabling Autostats

Immediately after enabling autostats on a table where it was previously disabled, MemSQL will start collecting automatic statistics in the background until it catches up with the current table data. During this time, the query optimizer will use the previous non-automatic statistics if available, until the autostats are sufficiently up-to-date - see the next section for details.

This background statistics collection uses limited resources to avoid large impacts on the performance of other workloads. Therefore, collecting automatic statistics may take a long time for large datasets, but this only affects the first time statistics are populated after autostats is enabled on a table where it was previously disabled.

Out-of-date Autostats

If there is too large a difference between the data which MemSQL has gathered statistics on and the actual table data, such as immediately after autostats is enabled, autostats will be considered out-of-date. As soon as the difference becomes small enough, autostats will be considered up-to-date. While autostats are out-of-date, the query optimizer uses the statistics saved the last time ANALYZE was run on the table, if any. EXPLAIN will show a warning that autostats is out of date.

You can accelerate the collection of autostats by running OPTIMIZE TABLE:

OPTIMIZE TABLE <table_name> FLUSH;
OPTIMIZE TABLE <table_name>;

After statistics have been updated, you may run ANALYZE TABLE to trigger plan invalidation and re-optimization of any query plans that were based on stale statistics - see the Plan invalidation section.

Types of Statistics

Column Statistics

Column statistics are collected on every column in a table when the table is processed by autostats or the ANALYZE command. These statistics include the number of distinct values in each column, the number of NULLs in each column, and the number of rows in the table.

You can view the column statistics last collected by querying the information_schema.OPTIMIZER_STATISTICS table.

DESCRIBE INFORMATION_SCHEMA.OPTIMIZER_STATISTICS;
****
+-----------------------+--------------+------+------+---------+-------+
| Field                 | Type         | Null | Key  | Default | Extra |
+-----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME         | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME            | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME           | varchar(512) | NO   |      | NULL    |       |
| ROW_COUNT             | bigint(21)   | YES  |      | NULL    |       |
| NULL_COUNT            | bigint(21)   | YES  |      | NULL    |       |
| CARDINALITY           | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_ENABLED     | tinyint(1)   | YES  |      | NULL    |       |
| AUTOSTATS_OUTDATED    | tinyint(1)   | YES  |      | NULL    |       |
| AUTOSTATS_ROW_COUNT   | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_NULL_COUNT  | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_CARDINALITY | bigint(21)   | YES  |      | NULL    |       |
| ADVANCED_HISTOGRAMS   | bigint(21)   | YES  |      | NULL    |       |
| LEGACY_HISTOGRAMS     | bigint(21)   | YES  |      | NULL    |       |
| RANGE_STATS           | bigint(21)   | YES  |      | NULL    |       |
| SAMPLE_SIZE           | bigint(21)   | YES  |      | NULL    |       |
| LAST_UPDATED          | datetime     | YES  |      | NULL    |       |
+-----------------------+--------------+------+------+---------+-------+

AUTOSTATS_ENABLED indicates if cardinality autostats are enabled.

The LAST_UPDATED column shows the time non-incremental statistics were saved. This occurs when ANALYZE runs, or on periodic autostats collection.

Info

information_schema.OPTIMIZER_STATISTICS.LAST_UPDATED is not accurate for columnstore tables using incremental autostats. This is because autostats are collected upon ingest and maintained per segment, meaning that they are not stored like other statistics collected by ANALYZE.

The best way to check that incremental autostats are up to date is to run EXPLAIN on the table in question. If no error is surfaced, autostats are up to date. See Missing Statistics for more information.

The RANGE_STATS column indicates whether any kind of histogram (advanced or legacy) is stored for this column. The LEGACY_HISTOGRAMS and ADVANCED_HISTOGRAMS columns indicate presence of legacy and new histograms, respectively.

AUTOSTATS_ROW_COUNT, AUTOSTATS_NULL_COUNT, and AUTOSTATS_CARDINALITY are deprecated. The columns remain in the table, but their value is always NULL.

Columns in information_schema.OPTIMIZER_STATISTICS will contain NULL values if autostats haven’t yet been collected, or autostats is disabled and ANALYZE has not yet been run.

Histograms

The ANALYZE command can additionally collect histograms over designated columns. When histograms are enabled for a column, ANALYZE will sample the rows in that column and create a histogram from the samples, recording the endpoints of each bucket and statistical information about the distribution of values within each bucket.

When collected, histograms can be queried through the information_schema.ADVANCED_HISTOGRAMS table:

DESCRIBE INFORMATION_SCHEMA.ADVANCED_HISTOGRAMS;
****
+-------------------------+---------------+------+------+---------+-------+
| Field                   | Type          | Null | Key  | Default | Extra |
+-------------------------+---------------+------+------+---------+-------+
| DATABASE_NAME           | varchar(512)  | NO   |      | NULL    |       |
| TABLE_NAME              | varchar(512)  | NO   |      | NULL    |       |
| COLUMN_NAME             | varchar(512)  | NO   |      | NULL    |       |
| TYPE                    | varchar(255)  | NO   |      | NULL    |       |
| VERSION                 | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_COUNT            | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_INDEX            | bigint(21)    | NO   |      | NULL    |       |
| RANGE_MIN               | varchar(8192) | YES  |      | NULL    |       |
| RANGE_MAX               | varchar(8192) | YES  |      | NULL    |       |
| UNIQUE_COUNT            | double        | YES  |      | NULL    |       |
| CARDINALITY             | double        | YES  |      | NULL    |       |
| UNIQUE_COUNT_CUMULATIVE | double        | YES  |      | NULL    |       |
| CARDINALITY_CUMULATIVE  | double        | YES  |      | NULL    |       |
+-------------------------+---------------+------+------+---------+-------+

The table will display, in human-readable form, the end points of the histogram buckets along with total and unique count for each bucket. Bucket index -1 corresponds to the number of nulls tracked by the histogram.

If you want to find a list of the columns that have histograms, you can use the following query:

SELECT DISTINCT DATABASE_NAME, TABLE_NAME, COLUMN_NAME
FROM information_schema.ADVANCED_HISTOGRAMS;

If you are using legacy histograms (using cardinality_estimation_level set to 6.0), information about the histograms is shown in information_schema.RANGE_STATISTICS.

Histograms vs. Sampling

Collecting column statistics is strongly recommended for optimal query performance and collecting histograms is recommended in most situations. If histograms are unavailable, the optimizer will use dynamic sampling (i.e. sample the table data by scanning part of the table) to compute estimates.

Histograms are much faster to use than sampling during query optimization, since the optimizer merely needs to read the previously collected statistics, while sampling may require scanning a large amount of table data.

Whether histogram-based estimates or sampling-based estimates are more accurate depends on the table and query. For example, histograms are most effective on columns where the overall distribution of data is not changing quickly (even if the data itself is changing quickly). When the data distribution is changing, it is important to periodically update statistics to reflect changes in data by re-running ANALYZE TABLE.

Conversely, histograms are least effective on columns where the data distribution is quickly and continuously changing. For example, a typical case where histograms are generally inaccurate and not recommended is on columns where the range of data is continuously growing, such as a timestamp or ID column where new, ever-increasing values are continuously being added. Histograms only reflect the range of data at the time the histograms were last updated, so new values outside that range are not reflected in the histograms. This leads to inaccurate estimates for queries that filter for this recent data. For these types of columns, leaving histograms disabled, which means sampling is used instead, typically leads to the best results.

Also, since histograms are one-dimensional, they cannot capture correlation between columns. Sampling can estimate predicates that feature a high degree of correlation between columns more accurately than histograms.

Finally, sampling may be inaccurate on predicates that are highly selective, or that have a high degree of skew between partitions. Histograms are typically more accurate on these predicates.

Identifying Missing Statistics

Missing Statistics Warnings in EXPLAIN

When you run EXPLAIN on a query, MemSQL will show information about any missing statistics that the optimizer could take advantage of, and the ANALYZE commands that you can run to collect those statistics.

The missing statistics listed here are statistics that are not present that the query optimizer would be able to use if collected. As discussed above, collecting column statistics is strongly recommended, and collecting histograms is recommended in most situations. See the Histograms vs. sampling section for more details.

For example, for the following query, the EXPLAIN output shows that the optimizer is missing column statistics on both tables t and t2 as well as histograms on t.x.

CREATE TABLE t (id BIGINT PRIMARY KEY, x DOUBLE);

CREATE TABLE t2 (id BIGINT PRIMARY KEY, x DOUBLE);

EXPLAIN SELECT * FROM t, t2 WHERE t.id = t2.id AND t.x > 1;
****
+---------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                     |
+---------------------------------------------------------------------------------------------+
| WARNING: Missing statistics on some tables, consider running the following commands:        |
|     ANALYZE TABLE db.`t2`;                                                                  |
|     ANALYZE TABLE db.`t` COLUMNS `x` ENABLE;                                                |
| See http://docs.memsql.com/v5.0/docs/analyze for more information on statistics collection. |
|                                                                                             |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| Gather partitions:all                                                                       |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| NestedLoopJoin                                                                              |
| |---IndexSeek db.t2, PRIMARY KEY (id) scan:[id = t.id]                                      |
| Filter [t.x > 1]                                                                            |
| TableScan db.t, PRIMARY KEY (id)                                                            |
+---------------------------------------------------------------------------------------------+

Missing Histograms in the Plancache

In addition, you can view information about any recently run queries which were missing histograms, which you can use to help identify which columns to collect histograms on.

The view information_schema.MV_PROSPECTIVE_HISTOGRAMS shows columns which did not have histograms and were used in queries where a histogram could have been used. This view counts over all queries in the in-memory plancache (see Understanding the Plancache), i.e. those queries shown in information_schema.PLANCACHE and information_schema.MV_QUERIES.

SELECT * FROM information_schema.MV_PROSPECTIVE_HISTOGRAMS;
+---------------+------------+-------------+-------------+
| DATABASE_NAME | TABLE_NAME | COLUMN_NAME | USAGE_COUNT |
+---------------+------------+-------------+-------------+
| db            | t          | c           |           3 |
| db            | t          | b           |           3 |
| db            | t          | a           |           3 |
+---------------+------------+-------------+-------------+

The view information_schema.MV_QUERY_PROSPECTIVE_HISTOGRAMS shows precisely which queries in the plancache could have made use of the histograms. You may join this to views such as MV_QUERIES and MV_ACTIVITIES to learn more about the queries.

SELECT * FROM information_schema.MV_QUERY_PROSPECTIVE_HISTOGRAMS;
+-----------------------------------+---------------+------------+-------------+-------------+
| ACTIVITY_NAME                     | DATABASE_NAME | TABLE_NAME | COLUMN_NAME | USAGE_COUNT |
+-----------------------------------+---------------+------------+-------------+-------------+
| Select_symbols_t_349f04a075ae9982 | db            | t          | a           |           1 |
| Select_symbols_t_bf93eac63fe6a2e8 | db            | t          | a           |           1 |
| Select_symbols_t_bf93eac63fe6a2e8 | db            | t          | c           |           1 |
| Select_symbols_t_056d53b2265b58ce | db            | t          | a           |           1 |
| Select_symbols_t_4222228538392d61 | db            | t          | b           |           1 |
| Select_symbols_t_ecb971f633697055 | db            | t          | b           |           1 |
| Select_symbols_t_ecb971f633697055 | db            | t          | c           |           1 |
| Select_symbols_t_d0692ff8d63f22e8 | db            | t          | b           |           1 |
| Select_symbols_t_d0692ff8d63f22e8 | db            | t          | c           |           1 |
+-----------------------------------+---------------+------------+-------------+-------------+

Both of these views display columns which were missing histograms at the time the queries were compiled. If you later collect histograms, the views are not updated unless/until the query is recompiled.

Statistics Lifetime

Statistics are tied to a base table, and are persisted through ALTER TABLE. Statistics are also persisted through server restarts. Statistics are deleted upon running DROP TABLE or ANALYZE TABLE t DROP.

Periodic Statistics Collection

You can configure settings for periodic statistics collection by setting the engine variables background_statistics_collection_interval and background_statistics_collection_threshold.

The following scenario demonstrates the periodic statistics collection process for autostats-enabled tables.

  1. The user creates a new database DB1 and table DB1.T1 with autostats-enabled and the following default settings:
  • background_statistics_collection_interval = 60 (trigger collection every 60 seconds)
  • background_statistics_collection_threshold = 0.5 (trigger collection when the ratio of changes to the number of rows hits 0.5)
  1. The user inserts 5 rows into the DB1.T1 table. The mv_row_change_counts table reports a total of 5 changes across the partitions for DB1.
  2. Within 60 seconds, the background thread wakes up and scans the change counts and tables. The row count is saved as 5, and after the stats collection, the last_analyzed_total_changes field in the mv_collected_row_change_counts table is set to the total_changes field, both of which will sum to 5 across partitions.
  3. The user inserts 3 new rows. The row change counts now sum to 8.
  4. Next time the background thread wakes, it will see 3 changes and observe the last collected row count (5). As 3/5 exceeds the 0.5 ratio, the collection is triggered again.