Outdated Version

You are viewing an older version of this section. View current production version.

Statistics, Sampling and ANALYZE

Introduction

In order to create good query plans, the query optimizer needs information about the size and distribution of table data. The optimizer uses this information to estimate the number of rows produced by different parts of a query plan as it searches to find the best overall plan.

By default, SingleStore DB automatically gathers statistics to give the optimizer the information it needs. The system that gathers statistics automatically is known as autostats. The term autostats is also used to describe statistics that are gathered automatically. See the section on autostats for details about how they are gathered.

It’s almost always best to leave autostats enabled. But if for some reason you don’t want to use autostats, you can disable it. If you turn off autostats, you should use the ANALYZE command to gather statistics manually. Most users can rely on autostats and will never need to run the ANALYZE command.

The ANALYZE command also can be used to trigger plan re-optimization when data statistics change, whether or not autostats is enabled. See the ANALYZE command section for more details.

SingleStore DB uses two types of statistics (see the Types of statistics section for more detail):

  • Column statistics, which include information on the cardinality (number of distinct values) of a column.
  • Histograms (also known as range statistics), which provide information on the distribution of data in a column.

Both types of statistics can be gathered by autostats, or manually with ANALYZE.

If statistics are not available or cannot be used to estimate the number of rows that part of a query plan will produce, the optimizer will sample the data to make an estimate. Because the optimizer can always fall back on sampling, it never needs to operate purely on guesswork when it needs statistical information to estimate the number of rows a part of a query plan will produce.

Sampling

Columnstore and rowstore tables both can be sampled by the query optimizer. Columnstore tables automatically have a random row-level sample built and maintained when autostats is enabled. This random sample is kept up to date in the background. The stored sample is sometimes called the columnstore sample. If a columnstore must be sampled during query optimization, this row-level sample is used.

Rowstore tables are sampled directly. An actual row-level random sample is collected by scanning the table structure and skipping a large fraction of the rows. This is an efficient process that takes time proportional to the sample size.

The ANALYZE Command

Statistics for a table can be manually collected and updated by running the ANALYZE command:

ANALYZE TABLE table_name;

Statistics gathered this way are called manual statistics or non-automatic statistics.

Manual statistics include column statistics on rowstore tables and all histograms (on both rowstore and columnstore tables).

If autostats are enabled for a table, ANALYZE for that table will not gather statistics and will return almost instantly. This is normal, because the autostats system already is maintaining the statistics so they don’t need to be gathered again.

ANALYZE also triggers plan invalidation and re-optimization when statistics change, as described in the Plan invalidation section. This is true whether autostats are enabled or not.

If autostats are disabled for a table, ANALYZE collects column statistics on every column, as well as histograms on chosen columns. You may designate which columns to collect histograms on with the ANALYZE TABLE table_name COLUMNS ... {ENABLE | DISABLE} commands, as described below.

Collecting column statistics is strongly recommended for optimal query performance. Collecting histograms is recommended in most situations. See the Histograms vs. sampling section for more details.

Manual statistics are not updated when the table is modified through DML statements, including INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE. They are only collected and updated when the ANALYZE command is called, which completely overwrites any previous statistics collected on the table.

Warning
  • ANALYZE may invalidate plans, i.e. cause them to be reoptimized and recompiled the next time they are run, when data statistics change and a new plan could be faster. See the Plan invalidation section for more details.

  • Collecting statistics manually with ANALYZE can be expensive for large tables, especially when collecting histograms. When autostats are disabled, because statistics are collected only when you run ANALYZE, you can choose when to run it to best manage the performance impact.

  • ANALYZE does not block concurrent queries and can be safely used while running concurrent queries. Any query compilations will use the present statistics until the ANALYZE completes, at which point query compilations atomically switch to using the new statistics.

Notes

All forms of ANALYZE are subject to the following restrictions:

  • ANALYZE must be run on the master aggregator node.
  • ANALYZE requires ALTER and SELECT permissions on the target table.
  • table_name must be the name of a table in the chosen database.
  • column_name must refer to a column in the table table_name.

The ANALYZE command has the following forms:

ANALYZE TABLE table_name;
ANALYZE TABLE table_name COLUMNS column_name [, ...] ENABLE;
ANALYZE TABLE table_name COLUMNS ALL ENABLE;
ANALYZE TABLE table_name COLUMNS column_name [, ...] DISABLE;
ANALYZE TABLE table_name COLUMNS ALL DISABLE;
ANALYZE TABLE table_name DROP;

These are described in detail below.

ANALYZE TABLE table_name;

Collect column statistics on all columns in table_name if it is not autostats-enabled, and collect histograms over all previously designated columns. Invalidate all plans on this table which were compiled with significantly different statistics than the newly observed statistics.

ANALYZE TABLE table_name COLUMNS column_name [, ...] ENABLE;

Designate the specified columns to have histograms collected, in addition to any previously designated columns. Then collects statistics and invalidates plans the same way as ANALYZE TABLE table_name.

ANALYZE TABLE table_name COLUMNS ALL ENABLE;

Designate all columns to have histograms collected, in addition to any previously designated columns. If the table contains columns which do not support histograms, a warning will be raised for each such column. Collects statistics and invalidates plans the same way as ANALYZE TABLE table_name.

Because this command creates a histogram on every column of the table, and collects histogram information for each new histogram, please note the following:

  • On very wide tables, this can be expensive to collect, increasing the execution time of the ANALYZE command.

  • The histograms are stored in memory. Make sure to plan for the extra memory enabling those histograms will consume.

  • Histograms are not always useful to have. Query the information_schema.mv_prospective_histograms table to check if a column has been identified from your workload, where having a histogram on that column would be beneficial to your queries.

  • In some cases sampling stats are better than histograms, see Histograms vs. Sampling for more information.

Instead, on clusters on v7.0+ we recommend using the default behavior: AUTOSTATS_HISTOGRAM_MODE=CREATE

With AUTOSTATS_HISTOGRAM_MODE set to CREATE, if a column is identified at least twice in information_schema.mv_prospective_histograms that having a histogram on it could be beneficial, then the histogram will be automatically created, and updated at that time. Like any histograms you create, it is updated when you run ANALYZE, and also as part of the periodic autostats (same interval: 10 minutes, and actually collected as part of the same process).

ANALYZE TABLE table_name COLUMNS column_name [, ...] DISABLE;

Delete histograms on specified columns if present, and designate the specified columns to not have histograms collected in the future. Does not collect any statistics.

ANALYZE TABLE table_name COLUMNS ALL DISABLE;

Delete histograms on all columns of the table if present, and designate all columns to not have histograms collected in the future. If the table contains columns which do not support histograms, a warning will be raised for each such column. Does not collect any statistics.

ANALYZE TABLE table_name DROP;

Deletes all stored manual statistics for a table, and disables histograms on any columns for which they have been enabled. Invalidates all plans on the table which were compiled using statistics. Does not delete automatic statistics.

Exporting and Importing Statistics

It is also possible to export and import statistics. This can be useful for testing purposes - for example, you can copy statistics from one cluster to another, in order to emulate the optimizer’s behavior on a test system that does not have the same data.

ANALYZE TABLE table_name INTO OUTFILE '</path/file.json>';

Save column and histograms recorded on table_name into the file /path/file.json in a JSON representation. You can use absolute or relative paths; relative paths are rooted in the data directory of the SingleStore DB instance.

Note that this exports the current statistics; it does not collect new statistics. It uses whatever set of statistics would currently be used by for query optimization - for column statistics on autostats-enabled tables, this would be the automatic statistics if they are enabled and up-to-date, and otherwise the statistics last saved by ANALYZE.

ANALYZE TABLE table_name INFILE '/path/file.json';

Import range and column statistics for table_name from the file /path/file.json. You can use absolute or relative paths; relative paths are rooted in the data directory of the SingleStore DB instance.

This command is disallowed if autostats is enabled on the table. If you wish to use imported column statistics, you should disable autostats and then import the statistics. If you wish to use imported histograms only on an autostats-enabled table, you can temporarily disable autostats, run ANALYZE ... INFILE to import the statistics, and re-enable autostats.

Configuration Parameters Export and Import

You can also export and import configuration parameters (including session and global variables) used by the optimizer via a .json file. Session variables are exported from or imported into the current session, while global variables are saved and set globally.

ANALYZE OPTIMIZER_STATE INTO OUTFILE '</path/file.json>';
ANALYZE OPTIMIZER_STATE INFILE '</path/file.json>';

INTO OUTFILE stores the configuration parameters into the specified file.

INFILE specifies a file from which to import configuration parameters.

Note: Using INFILE to set the configuration will update global variables. This can impact other SingleStore DB sessions.

Plan Invalidation

When data statistics change, an old query plan optimized based on the old data statistics may no longer be the best for the new data statistics. It is often desirable to re-optimize the query plan after the data statistics change significantly, as a new plan may be significantly better than the old plan. Running ANALYZE triggers re-optimization of any plans based on statistics for that table that are significantly different than the current statistics, by invalidating each such plan and recompiling the corresponding query the next time it is run.

Plans can be invalidated manually by running the ANALYZE command. To automatically invalidate the plan as autostats update, set the enable_background_plan_invalidation variable to ON. By default, this variable is set to OFF to prevent unexpected recompilation of plans when data changes.

SingleStore DB uses changes in the table rowcount as a heuristic for significant changes in statistics: a plan is eligible for invalidation if there is a table rowcount differing by a factor of 2 between the statistics it was compiled with and the current statistics, i.e. if an old rowcount (from the statistics at the time the query was compiled) is less than 50% or greater than 200% of the current rowcount. When you run ANALYZE, all plans eligible for invalidation (i.e. those where the current statistics and compile-time statistics differ in any table rowcount by at least 2x) are invalidated, i.e. they will be reoptimized and recompiled the next time they are run.

In addition, if autostats on a table were outdated when the plan was compiled, the plan will be eligible for invalidation after autostats become up-to-date (i.e. after autostats become up-to-date and you run ANALYZE, the plan will be invalidated).

Automatic Statistics

SingleStore DB automatically collects and updates column statistics on columnstore tables, incrementally in the background. This greatly reduces the operational burden around maintaining these statistics with ANALYZE, ensures they are available without requiring user action, and keeps them up-to-date as table data changes with less performance impact.

Automatic statistics (abbreviated autostats) provides automatic maintenance of column statistics. Statistics are collected and incrementally updated in the background as the table is modified with DML statements such as INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE, as well as ALTER TABLE.

While statistics are updated automatically, to re-optimize query plans after statistics change, you must trigger plan invalidation by running ANALYZE - see the Plan invalidation section.

Running ANALYZE also saves a copy of the table’s current automatic statistics. If you disable automatic statistics or they become out-of-date (see below), the query optimizer will use the last set of statistics saved by ANALYZE instead.

Incremental Autostats

Incremental autostats are supported only on columnstore tables. The incremental autostats gathering process updates the column statistics of a table incrementally as the table’s data changes. Incremental autostats are enabled by default on all columnstore tables.

Enhancements to Autostats in MemSQL 7.0

MemSQL 7.0 contains enhancements to column statistics, histograms, and sampling:

Periodic Autostats for Column Statistics

Periodic automatic statistics gathering (periodic autostats) is available in MemSQL 7.0 and later and is supported on both columnstore and rowstore tables. Periodic autostats collects column statistics on a table over a fixed period, but only after the table’s data has changed significantly. To collect these statistics, the SingleStore DB engine automatically runs ANALYZE on the table, in the background. As ANALYZE can be an expensive operation, total resource usage by periodic autostats is limited according to the system_optimizer_pool variable. In MemSQL 7.0 and later, periodic autostats collection is enabled by default on all rowstore tables

Incremental autostats are also available in MemSQL 7.0, as they were in previous versions. Since periodic autostats collect the same column statistics as do incremental autostats, only one can be enabled on a single table at a time.

Histogram Enhancements

Starting in MemSQL 7.0, histograms are created and updated automatically on all tables by default. When a table is first created, there are no histograms on any column. When queries are run against the table, the query optimizer queues histogram creation on columns where those histograms would be beneficial for optimizing the queries. As the table’s data changes, the created histograms are kept up to date as described in Periodic Autostats.

Uniform Random Sampling

Uniform random sampling for rowstore tables has been available in MemSQL since version 6.7. Starting in MemSQL 7.0, uniform random sampling is available for columnstore tables. By default, the optimizer uses this sampling on all tables. For rowstore tables, this sampling is implicit and no overhead is required to maintain it. For columnstore tables, SingleStore DB incrementally maintains the sample and keeps it up to date as the table’s data changes.

Autostats Support upon Upgrading MemSQL/SingleStore DB

Any table that is upgraded to a newer version of MemSQL/SingleStore DB retains its autostats settings. If the previous MemSQL/SingleStore DB version did not have autostats support, then the upgraded tables will have autostats disabled upon upgrade. Also, if a table had autostats enabled prior to upgrade, the same autostats will be enabled on the table post upgrade. Autostats will be enabled by default for any new table created in MemSQL version 7.0. To enable autostats on existing columnstore tables created prior to the upgrade, which we recommend in most cases, run ALTER TABLE table_name ENABLE AUTOSTATS on each table. As discussed in Disabling and enabling autostats, this will begin the process of collecting automatic statistics in the background.

Below is a simple script you can use to enable autostats on all tables (written in bash to minimize dependencies).

Save this script in a file called enable-autostats.sh:

#!/bin/bash

# Enables autostats on all columnstore tables which currently have autostats disabled
#
# Run this bash script with the mysql client arguments you use to connect to the master aggregator, like:
# bash enable-autostats.sh -h<host> -P<port> -u<user> -p<password>
#
# Use the --dry-run option to only print the list of tables which would be altered, without actually running the ALTER commands.

set -e

client_args=(mysql --batch --skip-column-names)
dry_run=false

while [[ $# -gt 0 ]]
do
    case "$1" in
        --dry-run)
            dry_run=true
            ;;
        *)
            client_args+=("$1")
            ;;
    esac
    shift
done

# List of table names (fully qualified and quoted)
tables=$(${client_args[@]} -e "select concat(table_schema, '.\`', replace(table_name, '\`', '\`\`'), '\`') from information_schema.tables where table_type = 'BASE TABLE'")

echo "$tables" | while read -r table
do
    show_create_table=$(${client_args[@]} -e "show create table $table")
    alter_stmt=""
    if echo "$show_create_table" | grep -q "AUTOSTATS_CARDINALITY_MODE=OFF"
    then
        if echo "$show_create_table" | grep -q "COLUMNSTORE"
        then
            alter_stmt="${alter_stmt}AUTOSTATS_CARDINALITY_MODE=INCREMENTAL"
        else
            alter_stmt="${alter_stmt}AUTOSTATS_CARDINALITY_MODE=PERIODIC"
        fi
    fi
    if echo "$show_create_table" | grep -q "AUTOSTATS_HISTOGRAM_MODE=OFF"
    then
        if [[ "$alter_stmt" != "" ]]
        then
            alter_stmt="${alter_stmt},"
        fi
        alter_stmt="${alter_stmt}AUTOSTATS_HISTOGRAM_MODE=CREATE"
    fi
    if echo "$show_create_table" | grep -q "AUTOSTATS_SAMPLING=OFF"
    then
        if echo "$show_create_table" | grep -q "COLUMNSTORE"
        then
            if [[ "$alter_stmt" != "" ]]
            then
                alter_stmt="${alter_stmt},"
            fi
        fi
        alter_stmt="${alter_stmt}AUTOSTATS_SAMPLING=ON"
    fi
    if [[ "$alter_stmt" != "" ]]
    then
        alter_stmt="ALTER TABLE $table $alter_stmt"
        echo $alter_stmt
        ${client_args[@]} -e "$alter_stmt"
    fi
done

Run this bash script with the mysql client arguments you use to connect to the master aggregator, like:

bash enable-autostats.sh -h<host> -P<port> -u<user> -p<password>

Disabling and Enabling Autostats

You can disable or enable autostats on each table. If you disable autostats on a table, statistics collection for that table becomes a manual process, i.e. column statistics must be collected with the ANALYZE command. If autostats is re-enabled, automatic statistics collection will resume (see the next section for details). We recommend leaving autostats enabled in most situations.

You can disable autostats on an existing table by running

ALTER TABLE table_name DISABLE AUTOSTATS;

or

ALTER TABLE table_name AUTOSTATS_ENABLED = OFF;

and enable autostats by running

ALTER TABLE table_name ENABLE AUTOSTATS;

or

ALTER TABLE table_name AUTOSTATS_ENABLED = ON;

You can choose whether autostats are enabled or disabled when creating a table by adding the option

AUTOSTATS_ENABLED=<TRUE|FALSE>

at the end of the CREATE TABLE statement, such as:

CREATE TABLE t (a int, b int, key (a) using clustered columnstore) AUTOSTATS_ENABLED=FALSE;

When AUTOSTATS_ENABLED is set to TRUE, autostats categories are set to the following default settings:

  • For rowstore tables,

    AUTOSTATS_CARDINALITY_MODE = PERIODIC
    AUTOSTATS_HISTOGRAM_MODE = CREATE
    
  • For columnstore tables,

    AUTOSTATS_CARDINALITY_MODE = INCREMENTAL
    AUTOSTATS_HISTOGRAM_MODE = CREATE
    AUTOSTATS_SAMPLING = ON
    

SingleStore DB allows you to independently control how each category of statistics is automatically gathered.

AUTOSTATS_CARDINALITY_MODE = [ INCREMENTAL | PERIODIC | OFF ]
AUTOSTATS_HISTOGRAM_MODE = [ CREATE | UPDATE | OFF ]
AUTOSTATS_SAMPLING = [ ON | OFF ]

These autostats settings can also be specified or modified through the ALTER TABLE and CREATE TABLE commands. Multiple options can be combined in a single CREATE TABLE statement. For example,

ALTER TABLE table_name AUTOSTATS_CARDINALITY_MODE = PERIODIC;
CREATE TABLE t1 (a INT, b INT, KEY(a) USING CLUSTERED COLUMNSTORE)
AUTOSTATS_CARDINALITY_MODE = INCREMENTAL,
AUTOSTATS_HISTOGRAM_MODE = CREATE,
AUTOSTATS_SAMPLING= ON;
Warning

Each type of autostats setting may not work for all types of tables. For example, you cannot have incremental autostats on a rowstore table.

Additionally, you can change whether autostats is enabled or disabled by default on future CREATE TABLE statements by setting the global variable default_autostats_enabled to TRUE or FALSE on the master aggregator. The default is TRUE.

You can also modify the default settings for autostats, individually, by setting the following global variables:

default_autostats_columnstore_cardinality_mode = [ INCREMENTAL | PERIODIC | OFF ]
default_autostats_rowstore_cardinality_mode = [ PERIODIC | OFF ]
default_autostats_histogram_mode = [ CREATE | UPDATE | OFF ]
default_autostats_columnstore_sampling = [ ON | OFF ]
Info

Limitations associated with using autostats:

  • Enabling AUTOSTATS_HISTOGRAM_MODE (either mode besides OFF) requires enabling AUTOSTATS_CARDINALITY_MODE (either mode besides OFF).
  • INCREMENTAL cardinality mode is only supported on columnstore tables.
  • AUTOSTATS_SAMPLING is not allowed for rowstore tables. On columnstore tables, it controls whether the columnstore sample is built and used for estimation.

You can see whether autostats is enabled or disabled on a table by running SHOW CREATE TABLE.

Possible reasons to disable autostats include:

  • Incremental autostats require a small memory overhead per column to maintain the statistics. If you have a large number of columns across all the columnstore tables in your database, this memory overhead may be significant. In this case you can modify your table to use periodic autostats, as the memory overhead is significantly lower and it will still keep the statistics up to date.
  • In CREATE mode (when default_autostats_histogram_mode is set to CREATE), automatic histograms may end up creating histograms on unwanted columns. As each histogram has a small memory overhead, you may choose to switch the histogram mode to UPDATE so that all existing histograms will be kept up to date, but no new histograms will be automatically created. You can still create any histograms manually using the ANALYZE command.
  • Histograms cannot be deleted in CREATE mode. Deleting a histogram in this mode would have limited impact, as the same histogram may be created automatically in the future. To choose between specific columns for creating histograms, a table must be in UPDATE mode.
  • Incremental autostats require a small performance cost on writes to a table. You may wish to consider disabling autostats if you find this write performance cost to be significant on your workload, can achieve better overall performance with manual statistics, and are willing to handle the operational requirements of managing statistics manually.
  • Periodic autostats uses a small fixed amount of resources in the background. If the resources used negatively affect your workload, you may find it beneficial to disable autostats.
  • If you want to “freeze” the statistics, so that the query optimizer continues to use a fixed set of statistics regardless of how the data is changing, you can do so by disabling automatic statistics and using ANALYZE to manage statistics.

Keep in mind that if you disable autostats, or change the default settings for it, there are several potential downsides:

  • In general, PERIODIC autostats run slower and use more CPU than INCREMENTAL autostats. If you do not have a problem with the memory overhead of INCREMENTAL autostats and are using a columnstore table, you should most likely use INCREMENTAL autostats.
  • Autostats collects statistics automatically by default, and updates statistics as the table data changes, but if you disable autostats, you will need to run ANALYZE to collect statistics, and you will have to update statistics as necessary by running ANALYZE.
  • If the statistics become stale, the query optimizer may choose bad query plans with much worse execution performance.
  • Collecting statistics manually with ANALYZE can be very costly for large tables. ANALYZE re-collects all statistics from scratch, which can be a very expensive operation on large tables, and can impact other workloads on the database. With autostats, the cost of updating statistics is amortized across writes, and there is no unnecessary work to re-collect statistics on parts of the table that aren’t changing.
  • By maintaining statistics incrementally, autostats can keep statistics much more up-to-date compared to batch statistics collection with ANALYZE.

Statistics Population after Enabling Autostats

Immediately after enabling autostats on a table where it was previously disabled, SingleStore DB will start collecting automatic statistics in the background until it catches up with the current table data. During this time, the query optimizer will use the previous manually-gathered statistics if available, until the autostats are sufficiently up-to-date - see the next section for details.

This background statistics collection uses limited resources to avoid large impacts on the performance of other workloads. Therefore, collecting automatic statistics may take a long time for large datasets, but this only affects the first time statistics are populated after autostats is enabled on a table where it was previously disabled.

Out-of-date or Missing Autostats

If there is too large a difference between the data which SingleStore DB has gathered statistics on and the actual table data, such as immediately after autostats is enabled, autostats will be considered out-of-date. As soon as the difference becomes small enough, autostats will be considered up-to-date. While autostats are out-of-date, the query optimizer uses the statistics saved the last time ANALYZE was run on the table, if any. EXPLAIN will show a warning that autostats is out of date. If autostats are missing, sampling will be used to get any needed estimates.

You can accelerate the collection of autostats by running OPTIMIZE TABLE:

OPTIMIZE TABLE <table_name> FLUSH;
OPTIMIZE TABLE <table_name>;

After statistics have been updated, you may run ANALYZE TABLE to trigger plan invalidation and re-optimization of any query plans that were based on stale statistics - see the Plan invalidation section.

Types of Statistics

Column Statistics

Column statistics are collected on every column in a table when the table is processed by autostats or the ANALYZE command. These statistics include the number of distinct values in each column, the number of NULLs in each column, and the number of rows in the table.

You can view the column statistics last collected by querying the information_schema.OPTIMIZER_STATISTICS table.

DESCRIBE INFORMATION_SCHEMA.OPTIMIZER_STATISTICS;
****
+-----------------------+--------------+------+------+---------+-------+
| Field                 | Type         | Null | Key  | Default | Extra |
+-----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME         | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME            | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME           | varchar(512) | NO   |      | NULL    |       |
| ROW_COUNT             | bigint(21)   | YES  |      | NULL    |       |
| NULL_COUNT            | bigint(21)   | YES  |      | NULL    |       |
| CARDINALITY           | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_ENABLED     | tinyint(1)   | YES  |      | NULL    |       |
| AUTOSTATS_OUTDATED    | tinyint(1)   | YES  |      | NULL    |       |
| AUTOSTATS_ROW_COUNT   | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_NULL_COUNT  | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_CARDINALITY | bigint(21)   | YES  |      | NULL    |       |
| ADVANCED_HISTOGRAMS   | bigint(21)   | YES  |      | NULL    |       |
| LEGACY_HISTOGRAMS     | bigint(21)   | YES  |      | NULL    |       |
| RANGE_STATS           | bigint(21)   | YES  |      | NULL    |       |
| SAMPLE_SIZE           | bigint(21)   | YES  |      | NULL    |       |
| LAST_UPDATED          | datetime     | YES  |      | NULL    |       |
+-----------------------+--------------+------+------+---------+-------+

AUTOSTATS_ENABLED indicates if cardinality autostats are enabled.

The LAST_UPDATED column shows the time non-incremental statistics were saved. This occurs when ANALYZE runs, or on periodic autostats collection.

Info

information_schema.OPTIMIZER_STATISTICS.LAST_UPDATED is not accurate for columnstore tables using incremental autostats. This is because autostats are collected upon ingest and maintained per segment, meaning that they are not stored like other statistics collected by ANALYZE.

The best way to check that incremental autostats are up to date is to run EXPLAIN on the table in question. If no error is surfaced, autostats are up to date. See Missing Statistics for more information.

The RANGE_STATS column indicates whether any kind of histogram is stored for this column. Two types of histograms, advanced or legacy, may be present. The LEGACY_HISTOGRAMS and ADVANCED_HISTOGRAMS columns indicate presence of legacy and new histograms, respectively. If you are using legacy histograms (using cardinality_estimation_level set to 6.0), information about the histograms is shown in information_schema.RANGE_STATISTICS.

AUTOSTATS_ROW_COUNT, AUTOSTATS_NULL_COUNT, and AUTOSTATS_CARDINALITY are deprecated. The columns remain in the table, but their value is always NULL.

Columns in information_schema.OPTIMIZER_STATISTICS will contain NULL values if autostats haven’t yet been collected, or autostats is disabled and ANALYZE has not yet been run.

Histograms

The ANALYZE command can additionally collect histograms over designated columns. When histograms are enabled for a column, ANALYZE will sample the rows in that column and create a histogram from the samples, recording the endpoints of each bucket and statistical information about the distribution of values within each bucket.

When collected, histograms can be queried through the information_schema.ADVANCED_HISTOGRAMS table:

DESCRIBE INFORMATION_SCHEMA.ADVANCED_HISTOGRAMS;
****
+-------------------------+---------------+------+------+---------+-------+
| Field                   | Type          | Null | Key  | Default | Extra |
+-------------------------+---------------+------+------+---------+-------+
| DATABASE_NAME           | varchar(512)  | NO   |      | NULL    |       |
| TABLE_NAME              | varchar(512)  | NO   |      | NULL    |       |
| COLUMN_NAME             | varchar(512)  | NO   |      | NULL    |       |
| TYPE                    | varchar(255)  | NO   |      | NULL    |       |
| VERSION                 | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_COUNT            | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_INDEX            | bigint(21)    | NO   |      | NULL    |       |
| RANGE_MIN               | varchar(8192) | YES  |      | NULL    |       |
| RANGE_MAX               | varchar(8192) | YES  |      | NULL    |       |
| UNIQUE_COUNT            | double        | YES  |      | NULL    |       |
| CARDINALITY             | double        | YES  |      | NULL    |       |
| UNIQUE_COUNT_CUMULATIVE | double        | YES  |      | NULL    |       |
| CARDINALITY_CUMULATIVE  | double        | YES  |      | NULL    |       |
+-------------------------+---------------+------+------+---------+-------+

The table will display, in human-readable form, the end points of the histogram buckets along with total and unique count for each bucket. Bucket index -1 corresponds to the number of nulls tracked by the histogram.

If you want to find a list of the columns that have histograms, you can use the following query:

SELECT DISTINCT DATABASE_NAME, TABLE_NAME, COLUMN_NAME
FROM information_schema.ADVANCED_HISTOGRAMS;

Histograms vs. Sampling

If histograms are unavailable, the optimizer will use a sample to compute estimates. For rowstore tables, it will use dynamic sampling (i.e. sample the table data by scanning part of the table) to compute estimates. For columnstore tables, it will use the stored columnstore sample.

Histograms are much faster to use than sampling during query optimization, since the optimizer merely needs to read the previously-collected histograms, which are relatively small objects, while sampling may require scanning a large amount of table data.

Whether histogram-based estimates or sampling-based estimates are more accurate depends on the table and query. For example, histograms are most effective on columns where the overall distribution of data is not changing quickly (even if the data itself is changing quickly). When the data distribution is changing, if autostats is disabled, it is important to periodically update histograms to reflect changes in data by re-running ANALYZE TABLE.

Conversely, histograms are least effective on columns where the data distribution is quickly and continuously changing. For example, a typical case where histograms are generally inaccurate and not recommended is on columns where the range of data is continuously growing, such as a timestamp or ID column where new, ever-increasing values are continuously being added. Histograms only reflect the range of data at the time the histograms were last updated, so new values outside that range are not reflected in the histograms. This leads to inaccurate estimates for queries that filter for this recent data. For these types of columns, leaving histograms disabled, which means sampling is used instead, typically leads to the best results.

Also, since histograms are one-dimensional, they cannot capture correlation between columns. Sampling can estimate predicates that feature a high degree of correlation between columns more accurately than histograms.

Finally, sampling may be inaccurate on predicates that are highly selective, or that have a high degree of skew between partitions. Histograms are typically more accurate on these predicates.

Identifying Missing Statistics

Missing Statistics Warnings in EXPLAIN

When you run EXPLAIN on a query, SingleStore DB will show information about any missing statistics that the optimizer could take advantage of, and the ANALYZE commands that you can run to collect those statistics. With autostats enabled, these messages will not appear or will go away on their own quickly, unless autostats is disabled for the needed columns. So if you see these messages, you typically should double check that autostats is enabled for all tables and columns.

When autostats are disabled, you should normally follow the recommendations in these warnings. The missing statistics listed here are statistics that are not present that the query optimizer would be able to use if collected.

Info

As discussed above, collecting column statistics is strongly recommended, and collecting histograms is recommended in most situations. See the Histograms vs. sampling section for more details.

For example, for the SELECT statement below, the EXPLAIN output shows that the optimizer is missing column statistics on both tables t and t2 as well as histograms on t.x.

CREATE TABLE t (id BIGINT PRIMARY KEY, x DOUBLE);
CREATE TABLE t2 (id BIGINT PRIMARY KEY, x DOUBLE);
EXPLAIN SELECT * FROM t, t2 WHERE t.id = t2.id AND t.x > 1;
****
+---------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                     |
+---------------------------------------------------------------------------------------------+
| WARNING: Missing statistics on some tables, consider running the following commands:        |
|     ANALYZE TABLE db.`t2`;                                                                  |
|     ANALYZE TABLE db.`t` COLUMNS `x` ENABLE;                                                |
| See http://docs.singlestore.com/v5.0/docs/analyze for more information on statistics collection. |
|                                                                                             |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| Gather partitions:all                                                                       |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| NestedLoopJoin                                                                              |
| |---IndexSeek db.t2, PRIMARY KEY (id) scan:[id = t.id]                                      |
| Filter [t.x > 1]                                                                            |
| TableScan db.t, PRIMARY KEY (id)                                                            |
+---------------------------------------------------------------------------------------------+

Missing Histograms in the Plancache

In addition, you can view information about any recently run queries which were missing histograms, which you can use to help identify which columns you may need to enable autostats on, or collect histograms on. Again, it’s highly recommended to use autostats, and if you are, then you will either not see missing histogram information, or such information can be disregarded because soon autostats will gather histograms.

The view information_schema.MV_PROSPECTIVE_HISTOGRAMS shows columns which did not have histograms and were used in queries where a histogram could have been used. This view counts over all queries in the in-memory plancache (see Understanding the Plancache), i.e. those queries shown in information_schema.PLANCACHE and information_schema.MV_QUERIES.

SELECT * FROM information_schema.MV_PROSPECTIVE_HISTOGRAMS;
****
+---------------+------------+-------------+-------------+
| DATABASE_NAME | TABLE_NAME | COLUMN_NAME | USAGE_COUNT |
+---------------+------------+-------------+-------------+
| db            | t          | c           |           3 |
| db            | t          | b           |           3 |
| db            | t          | a           |           3 |
+---------------+------------+-------------+-------------+

The view information_schema.MV_QUERY_PROSPECTIVE_HISTOGRAMS shows precisely which queries in the plancache could have made use of the histograms. You may join this to views such as MV_QUERIES and MV_ACTIVITIES to learn more about the queries.

SELECT * FROM information_schema.MV_QUERY_PROSPECTIVE_HISTOGRAMS;
****
+-----------------------------------+---------------+------------+-------------+-------------+
| ACTIVITY_NAME                     | DATABASE_NAME | TABLE_NAME | COLUMN_NAME | USAGE_COUNT |
+-----------------------------------+---------------+------------+-------------+-------------+
| Select_symbols_t_349f04a075ae9982 | db            | t          | a           |           1 |
| Select_symbols_t_bf93eac63fe6a2e8 | db            | t          | a           |           1 |
| Select_symbols_t_bf93eac63fe6a2e8 | db            | t          | c           |           1 |
| Select_symbols_t_056d53b2265b58ce | db            | t          | a           |           1 |
| Select_symbols_t_4222228538392d61 | db            | t          | b           |           1 |
| Select_symbols_t_ecb971f633697055 | db            | t          | b           |           1 |
| Select_symbols_t_ecb971f633697055 | db            | t          | c           |           1 |
| Select_symbols_t_d0692ff8d63f22e8 | db            | t          | b           |           1 |
| Select_symbols_t_d0692ff8d63f22e8 | db            | t          | c           |           1 |
+-----------------------------------+---------------+------------+-------------+-------------+

Both of these views display columns which were missing histograms at the time the queries were compiled. If you later collect histograms, or histograms are collected automatically, the views are not updated unless/until the query is recompiled.

Statistics Lifetime

Statistics are tied to a base table, and are persisted through ALTER TABLE or by autostats. Statistics are also persisted through server restarts. Statistics are deleted upon running DROP TABLE or ANALYZE TABLE t DROP.

Periodic Statistics Collection

You can configure settings for periodic statistics collection by setting the engine variables background_statistics_collection_interval and background_statistics_collection_threshold.

The following scenario demonstrates the periodic statistics collection process for autostats-enabled tables.

  1. The user creates a new database DB1 and table DB1.T1 with autostats enabled and the following default settings:
  • background_statistics_collection_interval = 60 (trigger collection every 60 seconds)
  • background_statistics_collection_threshold = 0.5 (trigger collection when the ratio of changes to the number of rows hits 0.5)
  1. The user inserts 5 rows into the DB1.T1 table. The mv_row_change_counts table reports a total of 5 changes across the partitions for DB1.
  2. Within 60 seconds, a background thread wakes up and scans the change counts. The row count is saved as 5, and after the stats collection, the last_analyzed_total_changes field in the mv_collected_row_change_counts table is set to the total_changes field, both of which will sum to 5 across partitions.
  3. The user inserts 3 new rows. The row change counts now sum to 8.
  4. Next time the background thread wakes, it will see 3 changes and observe the last collected row count (5). As 3/5 exceeds the 0.5 ratio, the collection is triggered again.