You are viewing an older version of this section. View current production version.
ANALYZE
The ANALYZE
command enables collecting value-level statistics on a table to facilitate accurate query optimization.
The ANALYZE Command
The main forms of the ANALYZE
command are:
ANALYZE TABLE table_name;
ANALYZE TABLE table_name COLUMNS column_name [, ...] {ENABLE | DISABLE};
ANALYZE
collects two types of statistics: column statistics on every column, and additionally range statistics on designated columns (see the Types of Statistics section). Columns are designated or un-designated with the ANALYZE TABLE table_name COLUMNS ... {ENABLE | DISABLE}
commands, as described below.
-
ANALYZE
may invalidate plans, i.e. cause them to be reoptimized and recompiled the next time they are run, when data statistics change and a new plan could be faster. See the Plan Invalidation section for more details. -
Statistics are collected and updated only when you run the
ANALYZE
command. -
The
ANALYZE
command can be expensive for large tables, especially when collecting range statistics. Because statistics are collected only when you runANALYZE
, you can choose when to run it to best manage performance. -
ANALYZE
does not block concurrent queries and can be safely used while running concurrent queries. Any query compilations will use the present statistics until theANALYZE
completes, at which point query compilations atomically switch to using the new statistics. -
You may need to collect statistics for optimal performance. See the Missing Statistics Warnings section.
Statistics are tied to a base table. Statistics are persisted and remain after restarts. Statistics are not changed when the table is modified through DML statements, including INSERT
, UPDATE
, DELETE
, LOAD DATA
, and TRUNCATE
. Statistics are only collected and updated when the ANALYZE
command is called, which completely overwrites any previous statistics collected on the table.
Notes
All forms of ANALYZE
are subject to the following restrictions:
ANALYZE
must be run on the master aggregator node.ANALYZE
requiresALTER
andSELECT
permissions on the target table.- As usual, you may specify
USING db_name
prior toANALYZE
, as in[USING db_name] ANALYZE ...
, wheredb_name
is a MemSQL database. The context database of the connection is used if theUSING
clause is omitted. table_name
must be the name of a table in the chosen database.column_name
must refer to a column in the tabletable_name
, if provided.
Forms of the ANALYZE command
ANALYZE TABLE table_name;
Collect column statistics on all columns in table_name
and range statistics over previously designated columns. Note that if statistics have not been collected on this table before, or since the last ALTER
on the table, no columns are considered previously designated for range statistics. Invalidate all plans on this table which were compiled with significantly different statistics than the newly observed statistics.
ANALYZE TABLE table_name COLUMNS column_name [, ...] ENABLE;
Designate the specified columns to have range-statistics collected, in addition to any previously designated columns. Collect column statistics on all columns in table_name
and range statistics over all designated columns, and invalidate all plans on this table which were compiled with significantly different statistics than the newly observed statistics (the same effects as ANALYZE TABLE table_name
).
ANALYZE TABLE table_name COLUMNS column_name [, ...] DISABLE;
Delete range statistics on specified columns if present, and designate the specified columns to not have range statistics collected. No range or column statistics are collected.
ANALYZE TABLE table_name DROP;
Delete all stored statistics for a table if present. This includes disabling histograms on any columns for which they have been enabled.
ANALYZE statistics export and import
It is also possible to export and import statistics, which can be useful for testing purposes.
ANALYZE TABLE table_name INTO OUTFILE 'path/file.json';
Save column and range statistics recorded on table_name
in a JSON representation into the file path/file.json
. Relative paths are rooted in the data
directory of the MemSQL instance. Note that this exports the statistics that are currently saved; it does not collect new statistics.
ANALYZE TABLE table_name INFILE 'path/file.json';
Import range and column statistics for table_name
from the file path/file.json
. Relative paths are rooted in the data
directory of the MemSQL instance.
The fake table name OPTIMIZER_STATE
can be used to save the current configuration parameters for the optimizer, useful for reproducing plans. E.g.:
ANALYZE OPTIMIZER_STATE INTO OUTFILE 'file.json';
ANALYZE OPTIMIZER_STATE INFILE 'file.json';
Types of Statistics
Column Statistics
Column statistics are collected on every column in a table when the table is processed by the ANALYZE
command. They include the number of NULL
s in each column, the number of distinct values in each column, the “density” of each column (a measure of the evenness of the frequency distribution) and the number of rows in the table. When collected, column statistics can be queried through the INFORMATION_SCHEMA.OPTIMIZER_STATISTICS
table:
memsql> describe INFORMATION_SCHEMA.OPTIMIZER_STATISTICS;
+----------------------+--------------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME | varchar(512) | NO | | NULL | |
| TABLE_NAME | varchar(512) | NO | | NULL | |
| COLUMN_NAME | varchar(512) | NO | | NULL | |
| ROW_COUNT | bigint(21) | NO | | NULL | |
| MIN_VALUE_SERIALIZED | varchar(255) | YES | | NULL | |
| MAX_VALUE_SERIALIZED | varchar(255) | YES | | NULL | |
| NULL_COUNT | bigint(21) | NO | | NULL | |
| CARDINALITY | bigint(21) | NO | | NULL | |
| DENSITY | double | NO | | NULL | |
| RANGE_STATS | bigint(21) | NO | | NULL | |
+----------------------+--------------+------+------+---------+-------+
10 rows in set (0.01 sec)",
The MIN_VALUE_SERIALIZED
and MAX_VALUE_SERIALIZED
columns are currently unused. The RANGE_STATS
column indicates whether range statistics are stored for this column.
Range Statistics
The ANALYZE
command can additionally collect equi-depth histograms over designated columns. When histograms are enabled for a column, ANALYZE
will sample the rows in that column and create a histogram from the samples such that the frequency of each bucket is the same, but their widths vary. Since the frequencies are equal, only the endpoints of the recorded buckets are stored. When collected, range histograms can be queried through the INFORMATION_SCHEMA.RANGE_STATISTICS
table:
memsql> describe INFORMATION_SCHEMA.RANGE_STATISTICS
+---------------+--------------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+------+---------+-------+
| DATABASE_NAME | varchar(512) | NO | | NULL | |
| TABLE_NAME | varchar(512) | NO | | NULL | |
| COLUMN_NAME | varchar(512) | NO | | NULL | |
| BUCKET_ID | bigint(21) | NO | | NULL | |
| MIN_VALUE | varchar(255) | NO | | NULL | |
| MAX_VALUE | varchar(255) | YES | | NULL | |
+---------------+--------------+------+------+---------+-------+
6 rows in set (0.00 sec)
The values for the endpoints are stored in an order-preserving mapping from their original type to bytestrings ordered lexicographically and shown in the table in a BASE64-encoded representation in the columns MIN_VALUE
and MAX_VALUE
. BUCKET_ID
stores the index of the bucket in the histogram.
Missing Statistics Warnings
Identifying which columns a query could use range statistics on can be tedious and error-prone to do manually and is complicated by query transformations like predicate transitivity and predicate pushdown. When you run EXPLAIN on a query, MemSQL will show information about any missing statistics that the optimizer could take advantage of, and the ANALYZE
commands that you can run to collect those statistics.
For example, for the following query, the explain shows that the optimizer is missing column statistics on both tables t
and t2
as well as range statistics on t.x
.
memsql> create table t (id bigint primary key, x double);
Query OK, 0 rows affected (0.17 sec)
memsql> create table t2 (id bigint primary key, x double);
Query OK, 0 rows affected (0.17 sec)
memsql> explain select * from t, t2 where t.id = t2.id and t.x > 1;
+---------------------------------------------------------------------------------------------+
| EXPLAIN |
+---------------------------------------------------------------------------------------------+
| WARNING: Missing statistics on some tables, consider running the following commands: |
| ANALYZE TABLE db.`t2`; |
| ANALYZE TABLE db.`t` COLUMNS `x` ENABLE; |
| See http://docs.memsql.com/v5.0/docs/analyze for more information on statistics collection. |
| |
| Project [t.id, t.id2, t.x, t2.id, t2.x] |
| Gather partitions:all |
| Project [t.id, t.id2, t.x, t2.id, t2.x] |
| NestedLoopJoin |
| |---IndexSeek db.t2, PRIMARY KEY (id) scan:[id = t.id] |
| Filter [t.x > 1] |
| TableScan db.t, PRIMARY KEY (id) |
+---------------------------------------------------------------------------------------------+
12 rows in set (0.14 sec)
Range Statistics vs. Sampling
Column statistics are used to estimate group-by and join cardinalities and are almost always beneficial to collect. Range statistics (histograms) are used to estimate filter selectivity, but the optimizer will resort to sampling when they are absent. Range statistics can make query optimization much faster and more accurate:
- Range statistics are more accurate for individual predicates (e.g.
x > 1000
), since they are collected over all partitions, whereas sampling sends queries to only one partition, potentially leading to very skewed estimates. - When present, range statistics are much faster to use than sampling during query optimization, since the optimizer merely needs to read the statistics which were replicated to all aggregators in the cluster by the “ANALYZE” command. In contrast, sampling requires compiling and running separate queries for each table and per-table predicate set on a leaf, which significantly adds to the latency in query optimization.
However, in some cases sampling can yield more accurate results:
- Range statistics can only estimate predicates of the form
<column> =/<>/>/</>=/<= <constant>
combined viaAND
andOR
. When range statistics are present for all such simple predicates, predicates involving more complex expressions (likeLIKE
andIN
) are estimated via heuristics which are oblivious to the actual distribution of data in the table. - Since histograms are one-dimensional, they cannot capture correlation between columns: Suppose column
x
is uniformly distributed between0
and1
, and columny = x + 1
. If histograms are collected on bothx
andy
, the optimizer will estimate the selectivity ofx > 0.5 AND y > 1.5
asP[x > 0.5] * P[y > 1.5] = 0.25
. If histograms are absent forx
ory
, the optimizer will run the querySELECT COUNT(*) FROM t WHERE x > 0.5 AND y > 0.5
and discover that the selectivity is actually0.5
.
In summary, range statistics are inaccurate in the presence of correlated predicates, whereas sampling is inaccurate in the presence of data skew across partitions.
Range statistics are more expensive to collect than column statistics and are only useful on columns that are involved in single-table equality or range predicates. In addition, the storage cost of range statistics is proportional to the size of the values in the column they’re collected over, unlike column statistics which take constant memory to store per column.
Plan Invalidation
When column statistics are collected on a table, the optimizer will record the row count associated with the statistics on the table into all plans compiled using those statistics. When ANALYZE TABLE
is run on the table, all plans which were optimized with significantly different statistics than the current statistics are invalidated, i.e. they will be reoptimized and recompiled the next time they are run, because a new plan may be significantly better after a change in data statistics. Significantly different statistics is defined as having a recorded rowcount less than 50% or greater than 200% of the new rowcount. (1) Plans optimized without column statistics on a table are always invalidated after ANALYZE TABLE
is run on the table (with the exception of when the table is empty).
(1): More precisely, when
ANALYZE TABLE
is run on the table, all in-memory plans with a recorded rowcount less than 50% or greater than 200% of the new rowcount are immediately invalidated. Plans in the persistent plancache are discarded when loaded if their recorded rowcount is less than 50% or greater than 200% of the rowcount of the table as recorded by the most recent statistics available for it.