Outdated Version

You are viewing an older version of this section. View current production version.

Distinct Count Estimation Functions

Imagine that you work at an airline that is marketing air travel to families. As a data analyst, you would like to count the number of families who have booked flights. In your total, you count a family if two or more passengers belonging to that family have booked a flight on the airline. The passengers do not need to be traveling on the same flight. Also, if a family has booked more than one flight, you count that family only once.

You run your calculation on the passenger_booking table, which has the columns flight_number, flight_date, passenger_id and family_id. Passengers who belong to the same family share the same family_id. The table does not include bookings for passengers traveling alone.

To find the unique family total, you run the query SELECT COUNT(DISTINCT family_id) AS result FROM passenger_booking.

The code below creates the passenger_booking table and populates it with sample data.

DELIMITER //
CREATE PROCEDURE populate_booking_table(tbl VARCHAR(40), number_of_records INT,
  _flight_date DATE, _family_id INT) AS
BEGIN
  FOR i IN 1..number_of_records LOOP
    /* This example doesn't populate the booking table with a
       flight number or a passenger ID; these values aren't needed to
       demonstrate the distinct count estimate. */    
    INSERT INTO passenger_booking (flight_date, family_id)
      VALUES (_flight_date, _family_id);
  END LOOP;
END
//
DELIMITER ;

/* The following procedure creates the passenger_booking table and populates
   it with sample values for four days. The procedure intentionally populates
   family_id with duplicate values to demonstrate the distinct count estimate
   later. */

DELIMITER //
CREATE PROCEDURE create_sample_data() AS
BEGIN
  CREATE TABLE passenger_booking (flight_number INT, flight_date DATE,
    passenger_id INT, family_id INT);

  FOR family_id IN 1..10000 LOOP
    CALL populate_booking_table('passenger_booking', 1,
      TO_DATE('01/01/2018','MM/DD/YYYY'), family_id);
  END LOOP;
  FOR family_id IN 5001..10000 LOOP
    CALL populate_booking_table('passenger_booking', 2,
      TO_DATE('01/01/2018','MM/DD/YYYY'), family_id);
  END LOOP;
  FOR family_id IN 8001..18000 LOOP
    CALL populate_booking_table('passenger_booking', 2,
      TO_DATE('01/02/2018','MM/DD/YYYY'), family_id);
  END LOOP;
  FOR family_id IN 15001..35000 LOOP
    CALL populate_booking_table('passenger_booking', 3,
      TO_DATE('01/03/2018','MM/DD/YYYY'), family_id);
  END LOOP;
  FOR family_id IN 30001..50000 LOOP
    CALL populate_booking_table('passenger_booking', 2,
      TO_DATE('01/04/2018','MM/DD/YYYY'), family_id);
  END LOOP;
END
//
DELIMITER ;

CALL create_sample_data();

Find family_id's distinct count estimate:

SELECT COUNT(DISTINCT family_id) AS result FROM passenger_booking;

****
+--------+
| result |
+--------+
| 100000 |
+--------+

Getting a Distinct Count Estimate (Method 1)

Calculating the distinct count of a large column can be time-consuming and memory-intensive. To reduce the cost of this calculation, you run an alternative function that gets an estimate of family_id's distinct count:

SELECT APPROX_COUNT_DISTINCT(family_id) AS result FROM passenger_booking;

****
+--------+
| result |
+--------+
| 100498 |
+--------+

APPROX_COUNT_DISTINCT is implemented using the HyperLogLog algorithm.

As shown in the following table, family_id's estimate calculation runs faster and consumes less memory, as compared to the exact calculation.

Query Execution Time (After query compilation) Memory Use (After query compilation)
SELECT COUNT(DISTINCT family_id) FROM passenger_booking 32 775614
SELECT APPROX_COUNT_DISTINCT(family_id) FROM passenger_booking 9 128453

Getting a Distinct Count Estimate (Method 2)

Suppose that every day, you want to estimate the number of unique family_ids for flights booked up to and including that date. Rather than calling APPROX_COUNT_DISTINCT to recalculate the count for all family_id values in the passenger_booking table, you call the functions APPROX_COUNT_DISTINCT_ACCUMULATE and APPROX_COUNT_DISTINCT_ESTIMATE. These are low-level functions that allow you to obtain incremental estimates, consuming less time and memory than APPROX_COUNT_DISTINCT.

In general, you work with the low-level functions as follows: You call APPROX_COUNT_DISTINCT_ACCUMULATE multiple times. On each invocation, you provide a data subset to summarize into a state. You then call APPROX_COUNT_DISTINCT_ESTIMATE to provide an estimate from the collection of input states.

The following code snippets illustrate how to use the low-level functions to generate incremental family_id estimates.

First, create a state table that the low-level functions will use.

CREATE TABLE distinct_family_day_states (flight_date DATE,
  state VARBINARY(16384));
Info

Use the data type VARBINARY(16384) to store states that you use with the low-level APPROX_COUNT_DISTINCT functions.

At the end of the first day, create a state for that day and insert it into the table containing the states:

INSERT INTO distinct_family_day_states(flight_date, state)
  SELECT flight_date, APPROX_COUNT_DISTINCT_ACCUMULATE(family_id)
  FROM passenger_booking
  WHERE flight_date = TO_DATE('01/01/2018', 'MM/DD/YYYY')
  GROUP BY flight_date;

Then get the distinct family estimate for the first day:

SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state)
  FROM distinct_family_day_states;

****
+----------------------------------------+
| APPROX_COUNT_DISTINCT_ESTIMATE(state)  |
+----------------------------------------+
|                                  10009 |
+----------------------------------------+  

At the end of the second day, create a state for that day and insert it into the table containing the states:

INSERT INTO distinct_family_day_states(flight_date, state)
  SELECT flight_date, APPROX_COUNT_DISTINCT_ACCUMULATE(family_id)
  FROM passenger_booking
  WHERE flight_date = TO_DATE('01/02/2018', 'MM/DD/YYYY')
  GROUP BY flight_date;

Then get the distinct family estimate for the first and second day:

SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state)
  FROM distinct_family_day_states;

****
+----------------------------------------+
| APPROX_COUNT_DISTINCT_ESTIMATE(state)  |
+----------------------------------------+
|                                  18227 |
+----------------------------------------+

To find the distinct family estimate for the third and fourth day, repeat the steps above. The results are shown in the table below.

Distinct family estimate for the third day 34856
Distinct family estimate for the fourth day 49933

Getting a Distinct Count Estimate (Method 3): Using APPROX_COUNT_DISTINCT_COMBINE

In the previous section, you found a daily rolling estimate of the number of families who booked flights. Imagine that you want to make this estimate every day for multiple years. You can optimize this calculation by taking the following steps, in order, at the end of every month:

  • Call APPROX_COUNT_DISTINCT_COMBINE to merge the daily states into a monthly state. Insert the monthly state into a monthly states table. The code below shows how to create the monthly state for January 2018. Note that a WHERE clause is not needed in the SELECT statement; the daily states table contains data for one month only.
INSERT INTO distinct_family_month_states
  SELECT 1, 2018, APPROX_COUNT_DISTINCT_COMBINE(state)
  FROM distinct_family_day_states;
  • Truncate the distinct_family_day_states table. Prior to truncating, the table should not contain daily states for any other month.

At the end of each day, insert the states from the distinct_family_day_states and distinct_family_month_states tables into a new temporary result set. Find the distinct family count estimate by running APPROX_COUNT_DISTINCT_ESTIMATE on the states in the result set:

WITH all_states AS (
  SELECT state FROM distinct_family_day_states
  UNION ALL
  SELECT state FROM distinct_family_month_states
)
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state) FROM all_states;