Tuesday, March 28, 2006

SQL: Calculating the geometric mean (GEOMEAN)

I had a user at the last place I worked who needed to calculate the geometric mean of a column in a table. The geometric mean is like a regular mean, but has the practical effect of throwing out outliers (unusual spikes in the data). This is a standard function in Excel, but not in most databases.

The standard method of calculating the geometric mean is by multiplying all of the terms together, then taking the n-th root of the product, where n is the number of terms. So if you had 5, 7, 55, 6, and 3 as terms, you would take the fifth root of 34,650, giving you 8.0899. Notice how the final result is not very much affected by the outlier (55).

This technique won't work in most databases for non-trivial datasets. Why? Because if you take 100 terms and multiply them, it will likely overflow even the largest column type. In fact, calculating the geometric mean this way with a calculator can quickly become impossible for large data sets.

There is a solution. If you take advantage of the magic of logarithms, you can calculate the geometric mean for very large data sets without overloading a standard relational database data type.

For a given set of terms, sum the log (I will use the natural log) of all terms. Then divide this sum by the number of terms. Finally, take the anti-log of the result (since I use a natural log, this means raising e to the result). Using our example:

ln( 5) =  1.6094
ln( 7) =  1.9459
ln(55) =  4.0073
ln( 6) =  1.7918
ln( 3) =  1.0986
         -------
         10.4531

10.4531/5 = 2.0906

e^2.0906 = 8.0899

which is what we got using the standard technique. This can be translated into SQL this way:

select exp(sum(log(sample)) / count(*))
  from sample_table

-- or, depending on your flavor of SQL...
select exp(sum(ln(sample)) / count(*))
  from sample_table
UPDATE: Isabelle pointed out that there is a neater way to do this. Since the sum(FOO)/count(*) is an average, the expression can be simplified to this:

SQL Server

select exp(avg(log(sample)))
  from sample_table

Oracle, PostGreSQL, MySQL

select exp(avg(ln(sample)))
  from sample_table

Nicely figured out, Isabelle.

As far as I can tell, this will work in Oracle, SQL Server, and Access, and I can't think of why it would not work in MySQL or PosgreSQL.

5 comments:

  1. I came across this blog when looking for a way of doing geometric mean in PostgreSQL. The query itself works nicely, but breaks horribly if the sample happens to be zero or negative, as log-functions are not defined for <= 0.0. But if such values are irrelevant (as may be the case with for example durations), then they can be fairly easily "ignored" by replacing the invalid values by NULL:

    select exp(avg(case when sample > 0.0 then ln(sample) else NULL end))
    from sample_table;

    Note that the above works only with PostgreSQL (as far as I know).

    ReplyDelete
  2. What I don't understand, is why Oracle returns non-whole number results for a single value geo mean.

    SELECT exp(avg(ln(160)))
    from dual;

    ReplyDelete
    Replies
    1. I get 160.000000000000000000000000000000000001 when I run it in SQL Developer. However, a coworker ran it in Toad and it returned 160. The difference is puzzling - practically it doesn't make an impact on our business case. But it does cause one to wonder why...

      Delete
    2. My guess is that this is a rounding error. Gotta be careful-- that's going to cause your books to be 1 penny off in Q3 213,000 AD!
      Thanks for the post, though! ;-)

      Delete

I moderate comments blog posts over 14 days old. This keeps a lot of spam away. I generally am all right about moderating. Thanks for understanding.