sql – Simple way to calculate median with MySQL

The Question :

218 people think this question is useful

What’s the simplest (and hopefully not too slow) way to calculate the median with MySQL? I’ve used AVG(x) for finding the mean, but I’m having a hard time finding a simple way of calculating the median. For now, I’m returning all the rows to PHP, doing a sort, and then picking the middle row, but surely there must be some simple way of doing it in a single MySQL query.

Example data:

id | val
--------
 1    4
 2    7
 3    2
 4    2
 5    9
 6    8
 7    3

Sorting on val gives 2 2 3 4 7 8 9, so the median should be 4, versus SELECT AVG(val) which == 5.

The Question Comments :
  • am I the only one nauseated by the fact that MySQL doesn’t have a function to calculate a median? Ridiculous.
  • MariaDB since version 10.3 has one, see mariadb.com/kb/en/library/median

The Answer 1

235 people think this answer is useful

In MariaDB / MySQL:

SELECT AVG(dd.val) as median_val
FROM (
SELECT d.val, @rownum:=@rownum+1 as `row_number`, @total_rows:=@rownum
  FROM data d, (SELECT @rownum:=0) r
  WHERE d.val is NOT NULL
  -- put some where clause here
  ORDER BY d.val
) as dd
WHERE dd.row_number IN ( FLOOR((@total_rows+1)/2), FLOOR((@total_rows+2)/2) );

Steve Cohen points out, that after the first pass, @rownum will contain the total number of rows. This can be used to determine the median, so no second pass or join is needed.

Also AVG(dd.val) and dd.row_number IN(...) is used to correctly produce a median when there are an even number of records. Reasoning:

SELECT FLOOR((3+1)/2),FLOOR((3+2)/2); -- when total_rows is 3, avg rows 2 and 2
SELECT FLOOR((4+1)/2),FLOOR((4+2)/2); -- when total_rows is 4, avg rows 2 and 3

Finally, MariaDB 10.3.3+ contains a MEDIAN function

The Answer 2

67 people think this answer is useful

I just found another answer online in the comments:

For medians in almost any SQL:

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2

Make sure your columns are well indexed and the index is used for filtering and sorting. Verify with the explain plans.

select count(*) from table --find the number of rows

Calculate the “median” row number. Maybe use: median_row = floor(count / 2).

Then pick it out of the list:

select val from table order by val asc limit median_row,1

This should return you one row with just the value you want.

Jacob

The Answer 3

34 people think this answer is useful

I found the accepted solution didn’t work on my MySQL install, returning an empty set, but this query worked for me in all situations that I tested it on:

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val)))/COUNT(*) > .5
LIMIT 1

The Answer 4

26 people think this answer is useful

Unfortunately, neither TheJacobTaylor’s nor velcrow’s answers return accurate results for current versions of MySQL.

Velcro’s answer from above is close, but it does not calculate correctly for result sets with an even number of rows. Medians are defined as either 1) the middle number on odd numbered sets, or 2) the average of the two middle numbers on even number sets.

So, here’s velcro’s solution patched to handle both odd and even number sets:

SELECT AVG(middle_values) AS 'median' FROM (
  SELECT t1.median_column AS 'middle_values' FROM
    (
      SELECT @row:=@row+1 as `row`, x.median_column
      FROM median_table AS x, (SELECT @row:=0) AS r
      WHERE 1
      -- put some where clause here
      ORDER BY x.median_column
    ) AS t1,
    (
      SELECT COUNT(*) as 'count'
      FROM median_table x
      WHERE 1
      -- put same where clause here
    ) AS t2
    -- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
    WHERE t1.row >= t2.count/2 and t1.row <= ((t2.count/2) +1)) AS t3;

To use this, follow these 3 easy steps:

  1. Replace “median_table” (2 occurrences) in the above code with the name of your table
  2. Replace “median_column” (3 occurrences) with the column name you’d like to find a median for
  3. If you have a WHERE condition, replace “WHERE 1” (2 occurrences) with your where condition

The Answer 5

12 people think this answer is useful

I propose a faster way.

Get the row count:

SELECT CEIL(COUNT(*)/2) FROM data;

Then take the middle value in a sorted subquery:

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.

The Answer 6

9 people think this answer is useful

Install and use this mysql statistical functions: http://www.xarg.org/2012/07/statistical-functions-in-mysql/

After that, calculate median is easy:

SELECT median(val) FROM data;

The Answer 7

8 people think this answer is useful

A comment on this page in the MySQL documentation has the following suggestion:

-- (mostly) High Performance scaling MEDIAN function per group
-- Median defined in http://en.wikipedia.org/wiki/Median
--
-- by Peter Hlavac
-- 06.11.2008
--
-- Example Table:

DROP table if exists table_median;
CREATE TABLE table_median (id INTEGER(11),val INTEGER(11));
COMMIT;


INSERT INTO table_median (id, val) VALUES
(1, 7), (1, 4), (1, 5), (1, 1), (1, 8), (1, 3), (1, 6),
(2, 4),
(3, 5), (3, 2),
(4, 5), (4, 12), (4, 1), (4, 7);



-- Calculating the MEDIAN
SELECT @a := 0;
SELECT
id,
AVG(val) AS MEDIAN
FROM (
SELECT
id,
val
FROM (
SELECT
-- Create an index n for every id
@a := (@a + 1) mod o.c AS shifted_n,
IF(@a mod o.c=0, o.c, @a) AS n,
o.id,
o.val,
-- the number of elements for every id
o.c
FROM (
SELECT
t_o.id,
val,
c
FROM
table_median t_o INNER JOIN
(SELECT
id,
COUNT(1) AS c
FROM
table_median
GROUP BY
id
) t2
ON (t2.id = t_o.id)
ORDER BY
t_o.id,val
) o
) a
WHERE
IF(
-- if there is an even number of elements
-- take the lower and the upper median
-- and use AVG(lower,upper)
c MOD 2 = 0,
n = c DIV 2 OR n = (c DIV 2)+1,

-- if its an odd number of elements
-- take the first if its only one element
-- or take the one in the middle
IF(
c = 1,
n = 1,
n = c DIV 2 + 1
)
)
) a
GROUP BY
id;

-- Explanation:
-- The Statement creates a helper table like
--
-- n id val count
-- ----------------
-- 1, 1, 1, 7
-- 2, 1, 3, 7
-- 3, 1, 4, 7
-- 4, 1, 5, 7
-- 5, 1, 6, 7
-- 6, 1, 7, 7
-- 7, 1, 8, 7
--
-- 1, 2, 4, 1

-- 1, 3, 2, 2
-- 2, 3, 5, 2
--
-- 1, 4, 1, 4
-- 2, 4, 5, 4
-- 3, 4, 7, 4
-- 4, 4, 12, 4


-- from there we can select the n-th element on the position: count div 2 + 1 

The Answer 8

6 people think this answer is useful

Most of the solutions above work only for one field of the table, you might need to get the median (50th percentile) for many fields on the query.

I use this:

SELECT CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(
 GROUP_CONCAT(field_name ORDER BY field_name SEPARATOR ','),
  ',', 50/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) AS `Median`
FROM table_name;

You can replace the “50” in example above to any percentile, is very efficient.

Just make sure you have enough memory for the GROUP_CONCAT, you can change it with:

SET group_concat_max_len = 10485760; #10MB max length

More details: http://web.performancerasta.com/metrics-tips-calculating-95th-99th-or-any-percentile-with-single-mysql-query/

The Answer 9

6 people think this answer is useful

I have this below code which I found on HackerRank and it is pretty simple and works in each and every case.

SELECT M.MEDIAN_COL FROM MEDIAN_TABLE M WHERE  
  (SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL < M.MEDIAN_COL ) = 
  (SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL > M.MEDIAN_COL );

The Answer 10

4 people think this answer is useful

Building off of velcro’s answer, for those of you having to do a median off of something that is grouped by another parameter:

SELECT grp_field, t1.val FROM (
   SELECT grp_field, @rownum:=IF(@s = grp_field, @rownum + 1, 0) AS row_number,
   @s:=IF(@s = grp_field, @s, grp_field) AS sec, d.val
  FROM data d,  (SELECT @rownum:=0, @s:=0) r
  ORDER BY grp_field, d.val
) as t1 JOIN (
  SELECT grp_field, count(*) as total_rows
  FROM data d
  GROUP BY grp_field
) as t2
ON t1.grp_field = t2.grp_field
WHERE t1.row_number=floor(total_rows/2)+1;

The Answer 11

3 people think this answer is useful

You could use the user-defined function that’s found here.

The Answer 12

3 people think this answer is useful

Takes care about an odd value count – gives the avg of the two values in the middle in that case.

SELECT AVG(val) FROM
  ( SELECT x.id, x.val from data x, data y
      GROUP BY x.id, x.val
      HAVING SUM(SIGN(1-SIGN(IF(y.val-x.val=0 AND x.id != y.id, SIGN(x.id-y.id), y.val-x.val)))) IN (ROUND((COUNT(*))/2), ROUND((COUNT(*)+1)/2))
  ) sq

The Answer 13

2 people think this answer is useful

If MySQL has ROW_NUMBER, then the MEDIAN is (be inspired by this SQL Server query):

WITH Numbered AS 
(
SELECT *, COUNT(*) OVER () AS Cnt,
    ROW_NUMBER() OVER (ORDER BY val) AS RowNum
FROM yourtable
)
SELECT id, val
FROM Numbered
WHERE RowNum IN ((Cnt+1)/2, (Cnt+2)/2)
;

The IN is used in case you have an even number of entries.

If you want to find the median per group, then just PARTITION BY group in your OVER clauses.

Rob

The Answer 14

2 people think this answer is useful

My code, efficient without tables or additional variables:

SELECT
((SUBSTRING_INDEX(SUBSTRING_INDEX(group_concat(val order by val), ',', floor(1+((count(val)-1) / 2))), ',', -1))
+
(SUBSTRING_INDEX(SUBSTRING_INDEX(group_concat(val order by val), ',', ceiling(1+((count(val)-1) / 2))), ',', -1)))/2
as median
FROM table;

The Answer 15

2 people think this answer is useful

Optionally, you could also do this in a stored procedure:

DROP PROCEDURE IF EXISTS median;
DELIMITER //
CREATE PROCEDURE median (table_name VARCHAR(255), column_name VARCHAR(255), where_clause VARCHAR(255))
BEGIN
  -- Set default parameters
  IF where_clause IS NULL OR where_clause = '' THEN
    SET where_clause = 1;
  END IF;

  -- Prepare statement
  SET @sql = CONCAT(
    "SELECT AVG(middle_values) AS 'median' FROM (
      SELECT t1.", column_name, " AS 'middle_values' FROM
        (
          SELECT @row:=@row+1 as `row`, x.", column_name, "
          FROM ", table_name," AS x, (SELECT @row:=0) AS r
          WHERE ", where_clause, " ORDER BY x.", column_name, "
        ) AS t1,
        (
          SELECT COUNT(*) as 'count'
          FROM ", table_name, " x
          WHERE ", where_clause, "
        ) AS t2
        -- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
        WHERE t1.row >= t2.count/2
          AND t1.row <= ((t2.count/2)+1)) AS t3
    ");

  -- Execute statement
  PREPARE stmt FROM @sql;
  EXECUTE stmt;
END//
DELIMITER ;


-- Sample usage:
-- median(table_name, column_name, where_condition);
CALL median('products', 'price', NULL);

The Answer 16

2 people think this answer is useful

My solution presented below works in just one query without creation of table, variable or even sub-query. Plus, it allows you to get median for each group in group-by queries (this is what i needed !):

SELECT `columnA`, 
SUBSTRING_INDEX(SUBSTRING_INDEX(GROUP_CONCAT(`columnB` ORDER BY `columnB`), ',', CEILING((COUNT(`columnB`)/2))), ',', -1) medianOfColumnB
FROM `tableC`
-- some where clause if you want
GROUP BY `columnA`;

It works because of a smart use of group_concat and substring_index.

But, to allow big group_concat, you have to set group_concat_max_len to a higher value (1024 char by default). You can set it like that (for current sql session) :

SET SESSION group_concat_max_len = 10000; 
-- up to 4294967295 in 32-bits platform.

More infos for group_concat_max_len: https://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_group_concat_max_len

The Answer 17

2 people think this answer is useful

Another riff on Velcrow’s answer, but uses a single intermediate table and takes advantage of the variable used for row numbering to get the count, rather than performing an extra query to calculate it. Also starts the count so that the first row is row 0 to allow simply using Floor and Ceil to select the median row(s).

SELECT Avg(tmp.val) as median_val
    FROM (SELECT inTab.val, @rows := @rows + 1 as rowNum
              FROM data as inTab,  (SELECT @rows := -1) as init
              -- Replace with better where clause or delete
              WHERE 2 > 1
              ORDER BY inTab.val) as tmp
    WHERE tmp.rowNum in (Floor(@rows / 2), Ceil(@rows / 2));

The Answer 18

2 people think this answer is useful
SELECT 
    SUBSTRING_INDEX(
        SUBSTRING_INDEX(
            GROUP_CONCAT(field ORDER BY field),
            ',',
            ((
                ROUND(
                    LENGTH(GROUP_CONCAT(field)) - 
                    LENGTH(
                        REPLACE(
                            GROUP_CONCAT(field),
                            ',',
                            ''
                        )
                    )
                ) / 2) + 1
            )),
            ',',
            -1
        )
FROM
    table

The above seems to work for me.

The Answer 19

2 people think this answer is useful

Single query to archive the perfect median:

SELECT 
COUNT(*) as total_rows, 
IF(count(*)%2 = 1, CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val SEPARATOR ','), ',', 50/100 * COUNT(*)), ',', -1) AS DECIMAL), ROUND((CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val SEPARATOR ','), ',', 50/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val SEPARATOR ','), ',', 50/100 * COUNT(*)), ',', -1) AS DECIMAL)) / 2)) as median, 
AVG(val) as average 
FROM 
data

The Answer 20

1 people think this answer is useful

I used a two query approach:

  • first one to get count, min, max and avg
  • second one (prepared statement) with a “LIMIT @count/2, 1” and “ORDER BY ..” clauses to get the median value

These are wrapped in a function defn, so all values can be returned from one call.

If your ranges are static and your data does not change often, it might be more efficient to precompute/store these values and use the stored values instead of querying from scratch every time.

The Answer 21

1 people think this answer is useful

as i just needed a median AND percentile solution, I made a simple and quite flexible function based on the findings in this thread. I know that I am happy myself if I find “readymade” functions that are easy to include in my projects, so I decided to quickly share:

function mysql_percentile($table, $column, $where, $percentile = 0.5) {

    $sql = "
            SELECT `t1`.`".$column."` as `percentile` FROM (
            SELECT @rownum:=@rownum+1 as `row_number`, `d`.`".$column."`
              FROM `".$table."` `d`,  (SELECT @rownum:=0) `r`
              ".$where."
              ORDER BY `d`.`".$column."`
            ) as `t1`, 
            (
              SELECT count(*) as `total_rows`
              FROM `".$table."` `d`
              ".$where."
            ) as `t2`
            WHERE 1
            AND `t1`.`row_number`=floor(`total_rows` * ".$percentile.")+1;
        ";

    $result = sql($sql, 1);

    if (!empty($result)) {
        return $result['percentile'];       
    } else {
        return 0;
    }

}

Usage is very easy, example from my current project:

...
$table = DBPRE."zip_".$slug;
$column = 'seconds';
$where = "WHERE `reached` = '1' AND `time` >= '".$start_time."'";

    $reaching['median'] = mysql_percentile($table, $column, $where, 0.5);
    $reaching['percentile25'] = mysql_percentile($table, $column, $where, 0.25);
    $reaching['percentile75'] = mysql_percentile($table, $column, $where, 0.75);
...

The Answer 22

1 people think this answer is useful

Here is my way . Of course, you could put it into a procedure 🙂

SET @median_counter = (SELECT FLOOR(COUNT(*)/2) - 1 AS `median_counter` FROM `data`);

SET @median = CONCAT('SELECT `val` FROM `data` ORDER BY `val` LIMIT ', @median_counter, ', 1');

PREPARE median FROM @median;

EXECUTE median;

You could avoid the variable @median_counter, if you substitude it:

SET @median = CONCAT( 'SELECT `val` FROM `data` ORDER BY `val` LIMIT ',
                      (SELECT FLOOR(COUNT(*)/2) - 1 AS `median_counter` FROM `data`),
                      ', 1'
                    );

PREPARE median FROM @median;

EXECUTE median;

The Answer 23

1 people think this answer is useful

This way seems include both even and odd count without subquery.

SELECT AVG(t1.x)
FROM table t1, table t2
GROUP BY t1.x
HAVING SUM(SIGN(t1.x - t2.x)) = 0

The Answer 24

1 people think this answer is useful

Based on @bob’s answer, this generalizes the query to have the ability to return multiple medians, grouped by some criteria.

Think, e.g., median sale price for used cars in a car lot, grouped by year-month.

SELECT 
    period, 
    AVG(middle_values) AS 'median' 
FROM (
    SELECT t1.sale_price AS 'middle_values', t1.row_num, t1.period, t2.count
    FROM (
        SELECT 
            @last_period:=@period AS 'last_period',
            @period:=DATE_FORMAT(sale_date, '%Y-%m') AS 'period',
            IF (@period<>@last_period, @row:=1, @row:=@row+1) as `row_num`, 
            x.sale_price
          FROM listings AS x, (SELECT @row:=0) AS r
          WHERE 1
            -- where criteria goes here
          ORDER BY DATE_FORMAT(sale_date, '%Y%m'), x.sale_price
        ) AS t1
    LEFT JOIN (  
          SELECT COUNT(*) as 'count', DATE_FORMAT(sale_date, '%Y-%m') AS 'period'
          FROM listings x
          WHERE 1
            -- same where criteria goes here
          GROUP BY DATE_FORMAT(sale_date, '%Y%m')
        ) AS t2
        ON t1.period = t2.period
    ) AS t3
WHERE 
    row_num >= (count/2) 
    AND row_num <= ((count/2) + 1)
GROUP BY t3.period
ORDER BY t3.period;

The Answer 25

1 people think this answer is useful

Often, we may need to calculate Median not just for the whole table, but for aggregates with respect to our ID. In other words, calculate median for each ID in our table, where each ID has many records. (good performance and works in many SQL + fixes problem of even and odds, more about performance of different Median-methods https://sqlperformance.com/2012/08/t-sql-queries/median )

SELECT our_id, AVG(1.0 * our_val) as Median
FROM
( SELECT our_id, our_val, 
  COUNT(*) OVER (PARTITION BY our_id) AS cnt,
  ROW_NUMBER() OVER (PARTITION BY our_id ORDER BY our_val) AS rn
  FROM our_table
) AS x
WHERE rn IN ((cnt + 1)/2, (cnt + 2)/2) GROUP BY our_id;

Hope it helps

The Answer 26

1 people think this answer is useful

MySQL has supported window functions since version 8.0, you can use ROW_NUMBER or DENSE_RANK (DO NOT use RANK as it assigns the same rank to same values, like in sports ranking):

SELECT AVG(t1.val) AS median_val
  FROM (SELECT val, 
               ROW_NUMBER() OVER(ORDER BY val) AS rownum
          FROM data) t1,
       (SELECT COUNT(*) AS num_records FROM data) t2
 WHERE t1.row_num IN
       (FLOOR((t2.num_records + 1) / 2), 
        FLOOR((t2.num_records + 2) / 2));

The Answer 27

1 people think this answer is useful

A simple way to calculate Median in MySQL

set @ct := (select count(1) from station);
set @row := 0;

select avg(a.val) as median from 
(select * from  table order by val) a
where (select @row := @row + 1)
between @ct/2.0 and @ct/2.0 +1;

The Answer 28

0 people think this answer is useful

After reading all previous ones they didn’t match with my actual requirement so I implemented my own one which doesn’t need any procedure or complicate statements, just I GROUP_CONCAT all values from the column I wanted to obtain the MEDIAN and applying a COUNT DIV BY 2 I extract the value in from the middle of the list like the following query does :

(POS is the name of the column I want to get its median)

(query) SELECT
SUBSTRING_INDEX ( 
   SUBSTRING_INDEX ( 
       GROUP_CONCAT(pos ORDER BY CAST(pos AS SIGNED INTEGER) desc SEPARATOR ';') 
    , ';', COUNT(*)/2 ) 
, ';', -1 ) AS `pos_med`
FROM table_name
GROUP BY any_criterial

I hope this could be useful for someone in the way many of other comments were for me from this website.

The Answer 29

0 people think this answer is useful

Knowing exact row count you can use this query:

SELECT <value> AS VAL FROM <table> ORDER BY VAL LIMIT 1 OFFSET <half>

Where <half> = ceiling(<size> / 2.0) - 1

The Answer 30

0 people think this answer is useful

I have a database containing about 1 billion rows that we require to determine the median age in the set. Sorting a billion rows is hard, but if you aggregate the distinct values that can be found (ages range from 0 to 100), you can sort THIS list, and use some arithmetic magic to find any percentile you want as follows:

with rawData(count_value) as
(
    select p.YEAR_OF_BIRTH
        from dbo.PERSON p
),
overallStats (avg_value, stdev_value, min_value, max_value, total) as
(
  select avg(1.0 * count_value) as avg_value,
    stdev(count_value) as stdev_value,
    min(count_value) as min_value,
    max(count_value) as max_value,
    count(*) as total
  from rawData
),
aggData (count_value, total, accumulated) as
(
  select count_value, 
    count(*) as total, 
        SUM(count(*)) OVER (ORDER BY count_value ROWS UNBOUNDED PRECEDING) as accumulated
  FROM rawData
  group by count_value
)
select o.total as count_value,
  o.min_value,
    o.max_value,
    o.avg_value,
    o.stdev_value,
    MIN(case when d.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
    MIN(case when d.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
    MIN(case when d.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
    MIN(case when d.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
    MIN(case when d.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value
from aggData d
cross apply overallStats o
GROUP BY o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value
;

This query depends on your db supporting window functions (including ROWS UNBOUNDED PRECEDING) but if you do not have that it is a simple matter to join aggData CTE with itself and aggregate all prior totals into the ‘accumulated’ column which is used to determine which value contains the specified precentile. The above sample calcuates p10, p25, p50 (median), p75, and p90.

-Chris

Add a Comment