You are viewing an older version of this section. View current production version.
Distinct Count Estimation Functions
Imagine that you work at an airline that is marketing air travel to families. As a data analyst, you would like to count the number of families who have booked flights. In your total, you count a family if two or more passengers belonging to that family have booked a flight on the airline. The passengers do not need to be traveling on the same flight. Also, if a family has booked more than one flight, you count that family only once.
You run your calculation on the passenger_booking
table, which has the columns flight_number
, flight_date
, passenger_id
and family_id
. Passengers who belong to the same family share the same family_id
. The table does not include bookings for passengers traveling alone.
To find the unique family total, you run the query SELECT COUNT(DISTINCT family_id) AS result FROM passenger_booking
.
The code below creates the passenger_booking
table and populates it with sample data.
DELIMITER //
CREATE PROCEDURE populate_booking_table(tbl VARCHAR(40), number_of_records INT,
_flight_date DATE, _family_id INT) AS
BEGIN
FOR i IN 1..number_of_records LOOP
/* This example doesn't populate the booking table with a
flight number or a passenger ID; these values aren't needed to
demonstrate the distinct count estimate. */
INSERT INTO passenger_booking (flight_date, family_id)
VALUES (_flight_date, _family_id);
END LOOP;
END
//
DELIMITER ;
/* The following procedure creates the passenger_booking table and populates
it with sample values for four days. The procedure intentionally populates
family_id with duplicate values to demonstrate the distinct count estimate
later. */
DELIMITER //
CREATE PROCEDURE create_sample_data() AS
BEGIN
CREATE TABLE passenger_booking (flight_number INT, flight_date DATE,
passenger_id INT, family_id INT);
FOR family_id IN 1..10000 LOOP
CALL populate_booking_table('passenger_booking', 1,
TO_DATE('01/01/2018','MM/DD/YYYY'), family_id);
END LOOP;
FOR family_id IN 5001..10000 LOOP
CALL populate_booking_table('passenger_booking', 2,
TO_DATE('01/01/2018','MM/DD/YYYY'), family_id);
END LOOP;
FOR family_id IN 8001..18000 LOOP
CALL populate_booking_table('passenger_booking', 2,
TO_DATE('01/02/2018','MM/DD/YYYY'), family_id);
END LOOP;
FOR family_id IN 15001..35000 LOOP
CALL populate_booking_table('passenger_booking', 3,
TO_DATE('01/03/2018','MM/DD/YYYY'), family_id);
END LOOP;
FOR family_id IN 30001..50000 LOOP
CALL populate_booking_table('passenger_booking', 2,
TO_DATE('01/04/2018','MM/DD/YYYY'), family_id);
END LOOP;
END
//
DELIMITER ;
CALL create_sample_data();
Find family_id
's distinct count estimate:
SELECT COUNT(DISTINCT family_id) AS result FROM passenger_booking;
****
+--------+
| result |
+--------+
| 100000 |
+--------+
Getting a Distinct Count Estimate (Method 1)
Calculating the distinct count of a large column can be time-consuming and memory-intensive. To reduce the cost of this calculation, you run an alternative function that gets an estimate of family_id
's distinct count:
SELECT APPROX_COUNT_DISTINCT(family_id) AS result FROM passenger_booking;
****
+--------+
| result |
+--------+
| 100498 |
+--------+
APPROX_COUNT_DISTINCT
is implemented using the HyperLogLog algorithm.
As shown in the following table, family_id
's estimate calculation runs faster and consumes less memory, as compared to the exact calculation.
Query | Execution Time (After query compilation) | Memory Use (After query compilation) |
---|---|---|
SELECT COUNT(DISTINCT family_id) FROM passenger_booking |
32 |
775614 |
SELECT APPROX_COUNT_DISTINCT(family_id) FROM passenger_booking |
9 |
128453 |
Getting a Distinct Count Estimate (Method 2)
Suppose that every day, you want to estimate the number of unique family_id
s for flights booked up to and including that date. Rather than calling APPROX_COUNT_DISTINCT
to recalculate the count for all family_id
values in the passenger_booking
table, you call the functions APPROX_COUNT_DISTINCT_ACCUMULATE
and APPROX_COUNT_DISTINCT_ESTIMATE
. These are low-level functions that allow you to obtain incremental estimates, consuming less time and memory than APPROX_COUNT_DISTINCT
.
In general, you work with the low-level functions as follows: You call APPROX_COUNT_DISTINCT_ACCUMULATE
multiple times. On each invocation, you provide a data subset to summarize into a state. You then call APPROX_COUNT_DISTINCT_ESTIMATE
to provide an estimate from the collection of input states.
The following code snippets illustrate how to use the low-level functions to generate incremental family_id
estimates.
First, create a state table that the low-level functions will use.
CREATE TABLE distinct_family_day_states (flight_date DATE,
state VARBINARY(16384));
Use the data type VARBINARY(16384)
to store states that you use with the low-level APPROX_COUNT_DISTINCT
functions.
At the end of the first day, create a state for that day and insert it into the table containing the states:
INSERT INTO distinct_family_day_states(flight_date, state)
SELECT flight_date, APPROX_COUNT_DISTINCT_ACCUMULATE(family_id)
FROM passenger_booking
WHERE flight_date = TO_DATE('01/01/2018', 'MM/DD/YYYY')
GROUP BY flight_date;
Then get the distinct family estimate for the first day:
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state)
FROM distinct_family_day_states;
****
+----------------------------------------+
| APPROX_COUNT_DISTINCT_ESTIMATE(state) |
+----------------------------------------+
| 10009 |
+----------------------------------------+
At the end of the second day, create a state for that day and insert it into the table containing the states:
INSERT INTO distinct_family_day_states(flight_date, state)
SELECT flight_date, APPROX_COUNT_DISTINCT_ACCUMULATE(family_id)
FROM passenger_booking
WHERE flight_date = TO_DATE('01/02/2018', 'MM/DD/YYYY')
GROUP BY flight_date;
Then get the distinct family estimate for the first and second day:
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state)
FROM distinct_family_day_states;
****
+----------------------------------------+
| APPROX_COUNT_DISTINCT_ESTIMATE(state) |
+----------------------------------------+
| 18227 |
+----------------------------------------+
To find the distinct family estimate for the third and fourth day, repeat the steps above. The results are shown in the table below.
Distinct family estimate for the third day | 34856 |
Distinct family estimate for the fourth day | 49933 |
Getting a Distinct Count Estimate (Method 3): Using APPROX_COUNT_DISTINCT_COMBINE
In the previous section, you found a daily rolling estimate of the number of families who booked flights. Imagine that you want to make this estimate every day for multiple years. You can optimize this calculation by taking the following steps, in order, at the end of every month:
- Call
APPROX_COUNT_DISTINCT_COMBINE
to merge the daily states into a monthly state. Insert the monthly state into a monthly states table. The code below shows how to create the monthly state for January 2018. Note that aWHERE
clause is not needed in theSELECT
statement; the daily states table contains data for one month only.
INSERT INTO distinct_family_month_states
SELECT 1, 2018, APPROX_COUNT_DISTINCT_COMBINE(state)
FROM distinct_family_day_states;
- Truncate the
distinct_family_day_states
table. Prior to truncating, the table should not contain daily states for any other month.
At the end of each day, insert the states from the distinct_family_day_states
and distinct_family_month_states
tables into a new temporary result set. Find the distinct family count estimate by running APPROX_COUNT_DISTINCT_ESTIMATE
on the states in the result set:
WITH all_states AS (
SELECT state FROM distinct_family_day_states
UNION ALL
SELECT state FROM distinct_family_month_states
)
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state) FROM all_states;