Outdated Version

You are viewing an older version of this section. View current production version.

ANALYZE

The ANALYZE command enables collecting value-level statistics on a table to facilitate accurate query optimization.

The ANALYZE Command

The main forms of the ANALYZE command are:

ANALYZE TABLE table_name;
ANALYZE TABLE table_name COLUMNS column_name [, ...] {ENABLE | DISABLE};

ANALYZE collects two types of statistics: column statistics on every column, and additionally range statistics on designated columns (see the Types of Statistics section). Columns are designated or un-designated with the ANALYZE TABLE table_name COLUMNS ... {ENABLE | DISABLE} commands, as described below.

Warning
  • ANALYZE may invalidate plans, i.e. cause them to be reoptimized and recompiled the next time they are run, when data statistics change and a new plan could be faster. See the Plan Invalidation section for more details.

  • Statistics are collected and updated only when you run the ANALYZE command.

  • The ANALYZE command can be expensive for large tables, especially when collecting range statistics. Because statistics are collected only when you run ANALYZE, you can choose when to run it to best manage performance.

  • ANALYZE does not block concurrent queries and can be safely used while running concurrent queries. Any query compilations will use the present statistics until the ANALYZE completes, at which point query compilations atomically switch to using the new statistics.

  • You may need to collect statistics for optimal performance. See the Missing Statistics Warnings section.

Info

Statistics are tied to a base table. Statistics are persisted and remain after restarts. Statistics are not changed when the table is modified through DML statements, including INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE. Statistics are only collected and updated when the ANALYZE command is called, which completely overwrites any previous statistics collected on the table.

Notes

All forms of ANALYZE are subject to the following restrictions:

  • ANALYZE must be run on the master aggregator node.
  • ANALYZE requires ALTER and SELECT permissions on the target table.
  • As usual, you may specify USING db_name prior to ANALYZE, as in [USING db_name] ANALYZE ..., where db_name is a MemSQL database. The context database of the connection is used if the USING clause is omitted.
  • table_name must be the name of a table in the chosen database.
  • column_name must refer to a column in the table table_name, if provided.

Forms of the ANALYZE command

ANALYZE TABLE table_name;

Collect column statistics on all columns in table_name and range statistics over previously designated columns. Note that if statistics have not been collected on this table before, or since the last ALTER on the table, no columns are considered previously designated for range statistics. Invalidate all plans on this table which were compiled with significantly different statistics than the newly observed statistics.

ANALYZE TABLE table_name COLUMNS column_name [, ...] ENABLE;

Designate the specified columns to have range-statistics collected, in addition to any previously designated columns. Collect column statistics on all columns in table_name and range statistics over all designated columns, and invalidate all plans on this table which were compiled with significantly different statistics than the newly observed statistics (the same effects as ANALYZE TABLE table_name).

ANALYZE TABLE table_name COLUMNS column_name [, ...] DISABLE;

Delete range statistics on specified columns if present, and designate the specified columns to not have range statistics collected. No range or column statistics are collected.

ANALYZE TABLE table_name DROP;

Delete all stored statistics for a table if present. This includes disabling histograms on any columns for which they have been enabled.

ANALYZE statistics export and import

It is also possible to export and import statistics, which can be useful for testing purposes.

ANALYZE TABLE table_name INTO OUTFILE 'path/file.json';

Save column and range statistics recorded on table_name in a JSON representation into the file path/file.json. Relative paths are rooted in the data directory of the MemSQL instance. Note that this exports the statistics that are currently saved; it does not collect new statistics.

ANALYZE TABLE table_name INFILE 'path/file.json';

Import range and column statistics for table_name from the file path/file.json. Relative paths are rooted in the data directory of the MemSQL instance.

The fake table name OPTIMIZER_STATE can be used to save the current configuration parameters for the optimizer, useful for reproducing plans. E.g.:

ANALYZE OPTIMIZER_STATE INTO OUTFILE 'file.json';
ANALYZE OPTIMIZER_STATE INFILE 'file.json';

Types of Statistics

Column Statistics

Column statistics are collected on every column in a table when the table is processed by the ANALYZE command. They include the number of NULLs in each column, the number of distinct values in each column, the “density” of each column (a measure of the evenness of the frequency distribution) and the number of rows in the table. When collected, column statistics can be queried through the INFORMATION_SCHEMA.OPTIMIZER_STATISTICS table:

memsql> describe INFORMATION_SCHEMA.OPTIMIZER_STATISTICS;
+----------------------+--------------+------+------+---------+-------+
| Field                | Type         | Null | Key  | Default | Extra |
+----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME        | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME           | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME          | varchar(512) | NO   |      | NULL    |       |
| ROW_COUNT            | bigint(21)   | NO   |      | NULL    |       |
| MIN_VALUE_SERIALIZED | varchar(255) | YES  |      | NULL    |       |
| MAX_VALUE_SERIALIZED | varchar(255) | YES  |      | NULL    |       |
| NULL_COUNT           | bigint(21)   | NO   |      | NULL    |       |
| CARDINALITY          | bigint(21)   | NO   |      | NULL    |       |
| DENSITY              | double       | NO   |      | NULL    |       |
| RANGE_STATS          | bigint(21)   | NO   |      | NULL    |       |
+----------------------+--------------+------+------+---------+-------+
10 rows in set (0.01 sec)",

The MIN_VALUE_SERIALIZED and MAX_VALUE_SERIALIZED columns are currently unused. The RANGE_STATS column indicates whether range statistics are stored for this column.

Range Statistics

The ANALYZE command can additionally collect equi-depth histograms over designated columns. When histograms are enabled for a column, ANALYZE will sample the rows in that column and create a histogram from the samples such that the frequency of each bucket is the same, but their widths vary. Since the frequencies are equal, only the endpoints of the recorded buckets are stored. When collected, range histograms can be queried through the INFORMATION_SCHEMA.RANGE_STATISTICS table:

memsql> describe INFORMATION_SCHEMA.RANGE_STATISTICS
+---------------+--------------+------+------+---------+-------+
| Field         | Type         | Null | Key  | Default | Extra |
+---------------+--------------+------+------+---------+-------+
| DATABASE_NAME | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME    | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME   | varchar(512) | NO   |      | NULL    |       |
| BUCKET_ID     | bigint(21)   | NO   |      | NULL    |       |
| MIN_VALUE     | varchar(255) | NO   |      | NULL    |       |
| MAX_VALUE     | varchar(255) | YES  |      | NULL    |       |
+---------------+--------------+------+------+---------+-------+
6 rows in set (0.00 sec)

The values for the endpoints are stored in an order-preserving mapping from their original type to bytestrings ordered lexicographically and shown in the table in a BASE64-encoded representation in the columns MIN_VALUE and MAX_VALUE. BUCKET_ID stores the index of the bucket in the histogram.

Missing Statistics Warnings

Identifying which columns a query could use range statistics on can be tedious and error-prone to do manually and is complicated by query transformations like predicate transitivity and predicate pushdown. When you run EXPLAIN on a query, MemSQL will show information about any missing statistics that the optimizer could take advantage of, and the ANALYZE commands that you can run to collect those statistics.

For example, for the following query, the explain shows that the optimizer is missing column statistics on both tables t and t2 as well as range statistics on t.x.

memsql> create table t (id bigint primary key, x double);
Query OK, 0 rows affected (0.17 sec)

memsql> create table t2 (id bigint primary key, x double);
Query OK, 0 rows affected (0.17 sec)

memsql> explain select * from t, t2 where t.id = t2.id and t.x > 1;
+---------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                     |
+---------------------------------------------------------------------------------------------+
| WARNING: Missing statistics on some tables, consider running the following commands:        |
|     ANALYZE TABLE db.`t2`;                                                                  |
|     ANALYZE TABLE db.`t` COLUMNS `x` ENABLE;                                                |
| See http://docs.memsql.com/v5.0/docs/analyze for more information on statistics collection. |
|                                                                                             |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| Gather partitions:all                                                                       |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| NestedLoopJoin                                                                              |
| |---IndexSeek db.t2, PRIMARY KEY (id) scan:[id = t.id]                                      |
| Filter [t.x > 1]                                                                            |
| TableScan db.t, PRIMARY KEY (id)                                                            |
+---------------------------------------------------------------------------------------------+
12 rows in set (0.14 sec)

Range Statistics vs. Sampling

Column statistics are used to estimate group-by and join cardinalities and are almost always beneficial to collect. Range statistics (histograms) are used to estimate filter selectivity, but the optimizer will resort to sampling when they are absent. Range statistics can make query optimization much faster and more accurate:

  • Range statistics are more accurate for individual predicates (e.g. x > 1000), since they are collected over all partitions, whereas sampling sends queries to only one partition, potentially leading to very skewed estimates.
  • When present, range statistics are much faster to use than sampling during query optimization, since the optimizer merely needs to read the statistics which were replicated to all aggregators in the cluster by the “ANALYZE” command. In contrast, sampling requires compiling and running separate queries for each table and per-table predicate set on a leaf, which significantly adds to the latency in query optimization.

However, in some cases sampling can yield more accurate results:

  • Range statistics can only estimate predicates of the form <column> =/<>/>/</>=/<= <constant> combined via AND and OR. When range statistics are present for all such simple predicates, predicates involving more complex expressions (like LIKE and IN) are estimated via heuristics which are oblivious to the actual distribution of data in the table.
  • Since histograms are one-dimensional, they cannot capture correlation between columns: Suppose column x is uniformly distributed between 0 and 1, and column y = x + 1. If histograms are collected on both x and y, the optimizer will estimate the selectivity of x > 0.5 AND y > 1.5 as P[x > 0.5] * P[y > 1.5] = 0.25. If histograms are absent for x or y, the optimizer will run the query SELECT COUNT(*) FROM t WHERE x > 0.5 AND y > 0.5 and discover that the selectivity is actually 0.5.

In summary, range statistics are inaccurate in the presence of correlated predicates, whereas sampling is inaccurate in the presence of data skew across partitions.

Range statistics are more expensive to collect than column statistics and are only useful on columns that are involved in single-table equality or range predicates. In addition, the storage cost of range statistics is proportional to the size of the values in the column they’re collected over, unlike column statistics which take constant memory to store per column.

Plan Invalidation

When column statistics are collected on a table, the optimizer will record the row count associated with the statistics on the table into all plans compiled using those statistics. When ANALYZE TABLE is run on the table, all plans which were optimized with significantly different statistics than the current statistics are invalidated, i.e. they will be reoptimized and recompiled the next time they are run, because a new plan may be significantly better after a change in data statistics. Significantly different statistics is defined as having a recorded rowcount less than 50% or greater than 200% of the new rowcount. (1) Plans optimized without column statistics on a table are always invalidated after ANALYZE TABLE is run on the table (with the exception of when the table is empty).

(1): More precisely, when ANALYZE TABLE is run on the table, all in-memory plans with a recorded rowcount less than 50% or greater than 200% of the new rowcount are immediately invalidated. Plans in the persistent plancache are discarded when loaded if their recorded rowcount is less than 50% or greater than 200% of the rowcount of the table as recorded by the most recent statistics available for it.