Outdated Version

You are viewing an older version of this section. View current production version.

APPROX_COUNT_DISTINCT

Aggregate function. This does a probabilistic count of the number of distinct values in the given column or expression, using a variant of the HyperLogLog algorithm.

Calculating an exact count of distinct values (i.e. the cardinality) of a big table can consume large amounts of memory and time. In many cases, an approximation to within a percent or two is just as useful and can be calculated very efficiently.

Briefly, the algorithm runs each value through a hash function and looks for hash values with lots of leading zeros. The probability of a binary hash value having, say, 20 leading zeros in a row is two to the twentieth power. If you find one of those in your set of hash values, this means there are probably at least 2**20 unique values in your set; roughly a million.

However, trusting that estimate naively can lead to inflated results. If it just happens that a very unlikely hash value appears in a very small set, your estimate will be far too high. To correct for this, HLL makes an estimation of many subsets of the values, then averages those estimates together. In this way, a single unlikely value will not greatly skew the estimate.

Syntax

APPROX_COUNT_DISTINCT(<expression>)

Arguments

  • The argument to be counted can be a column name or a simple expression (eg, col_a - col_b).

Return Type

An integer.

Examples

memsql> select approx_count_distinct(id) as approx, count(distinct id) as exact from my_table;
+-----------+----------+
| approx    | exact    |
+-----------+----------+
|   4982279 |  5000000 |
+-----------+----------+
1 row in set (0.00 sec)