Outdated Version

You are viewing an older version of this section. View current production version.

PERCENTILE_CONT and Median

In MemSQL, percentile functions are only available as window functions. There are two percentile functions: PERCENTILE_CONT and PERCENTILE_DISC. PERCENTILE_CONT is an inverse distribution function. It assumes a continuous distribution between values of the expression in the sort specification. Then, it interpolates the value of that expression at the given percentile, performing a linear interpolation. For an overview of window functions in MemSQL, see Window Functions Guide.

Syntax

PERCENTILE_CONT(percentile)
WITHIN GROUP (ORDER BY (col | expr))
[OVER (
    [PARTITION BY (col | expr), ...]
    [frame_clause]
)]

or equivalently

PERCENTILE_CONT(percentile)
OVER (
        ORDER BY (col | expr)
        [PARTITION BY {col | expr}, ...]
        [frame_clause]
)

Arguments

percentile

A numeric value between 0 and 1.

col

A column of a numeric data type.

expr

An expression that evaluates to a numeric data type.

Remarks

The OVER() clause defines how to partition the input set. If no PARTITION BY clause is specified, the entire input set is treated as a single partition.

Only one column or expression is allowed in the ORDER BY clause.

Example

The following example creates a table containing SAT scores. The resulting window frame shows the hypothetical 90th percentile SAT score given the existing scores.

CREATE TABLE SAT (reading_writing int, math int);

INSERT INTO SAT VALUES (800,800),(650,300),(700,500);

SELECT reading_writing, math, reading_writing + math AS total,
PERCENTILE_CONT(.9)
WITHIN GROUP (ORDER BY reading_writing + math)
OVER () AS `90th Percentile` FROM SAT;
****
+-----------------+------+-------+-----------------+
| reading_writing | math | total | 90th Percentile |
+-----------------+------+-------+-----------------+
|             650 |  300 |   950 |       1520.0000 |
|             700 |  500 |  1200 |       1520.0000 |
|             800 |  800 |  1600 |       1520.0000 |
+-----------------+------+-------+-----------------+

Median

Compute the median using percentile = 0.5.

PERCENTILE_CONT( 0.5 ) ...

Example:

CREATE TABLE t (a int);
INSERT INTO t VALUES (1),(2),(3),(166),(1024);

SELECT a, AVG(a) OVER () AS mean,
PERCENTILE_CONT(.5) WITHIN GROUP (ORDER BY a)
OVER () AS median FROM t;
****
+------+----------+--------+
| a    | mean     | median |
+------+----------+--------+
|    1 | 239.2000 | 3.0000 |
|    2 | 239.2000 | 3.0000 |
|    3 | 239.2000 | 3.0000 |
|  166 | 239.2000 | 3.0000 |
| 1024 | 239.2000 | 3.0000 |
+------+----------+--------+

Median and Percentiles with Regular Aggregates and Grouping

Suppose you have some data with columns you want to group by and another column you want to aggregate using several aggregate functions, including standard ones like AVG, MIN, MAX etc., as well as percentiles and median. In other words, you want to use percentile functions and median just like regular aggregates. This is not supported directly, but there is a query pattern you can follow to make this possible.

For example, suppose you have the following data:

CREATE TABLE t (id int, category varchar(10), year int, value int);
INSERT t VALUES
  (1, "ABC", 2019, 100),
  (2, "ABC", 2019, 150),
  (3, "ABC", 2019, 200),
  (4, "ABC", 2020, 100),
  (5, "ABC", 2020, 175),
  (6, "ABC", 2020, 250),
  (7, "XYZ", 2020, 200),
  (8, "XYZ", 2020, 275),
  (9, "XYZ", 2020, 350),
  (10, "XYZ", 2020, 450);

Suppose you want to compute the following aggregates of value grouped by category and year for this data: MIN, MAX, AVG, 10th percentile, MEDIAN, and 90th percentile. Logically, the query you want to specify is:

SELECT category, year, SUM(value), AVG(value),
  PERCENTILE_CONT(0.1, value), MEDIAN(value), PERCENTILE_CONT(0.9, value)
FROM t
GROUP BY category, year
ORDER BY category, year;

The above query is not directly supported because PERCENTILE_CONT is not allowed as a standard aggregate function, and there is no MEDIAN aggregate function. However, the following query gives the result intended by the logical query above.

WITH s AS
(
  SELECT category, year, value,
    percentile_cont(0.1) OVER w AS pct10,
    percentile_cont(0.5) OVER w AS median,
    percentile_cont(0.9) OVER w AS pct90
  FROM t
  WINDOW w AS (PARTITION BY category, year ORDER BY value
               ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
SELECT category, year, SUM(value), AVG(value),
  ANY_VALUE(pct10) AS pct10,
  ANY_VALUE(median) AS median,
  ANY_VALUE(pct90) AS pct90
FROM s
GROUP BY category, year
ORDER BY category, year;
****
+----------+------+------------+------------+----------+----------+----------+
| category | year | SUM(value) | AVG(value) | pct10    | median   | pct90    |
+----------+------+------------+------------+----------+----------+----------+
| ABC      | 2019 |        450 |   150.0000 | 110.0000 | 150.0000 | 190.0000 |
| ABC      | 2020 |        525 |   175.0000 | 115.0000 | 175.0000 | 235.0000 |
| XYZ      | 2020 |       1275 |   318.7500 | 222.5000 | 312.5000 | 420.0000 |
+----------+------+------------+------------+----------+----------+----------+

This query gets the desired result by first creating a rowset that has one row for each input row that includes the columns of the input row as well as the 10th percentile, MEDIAN value calculated using a 50th percentile, and 90th percentile. Then it groups by id and category, peforms the standard aggregates (SUM, AVG) as usual, and uses the ANY_VALUE aggregate to pick any one of the values for pct10, median, and pct90. Since thay are all the same within each group, it doesn’t matter which one is selected, so ANY_VALUE is appropriate.