SQL: Calculating the geometric mean (GEOMEAN)
I had a user at the last place I worked who needed to calculate the geometric mean of a column in a table. The geometric mean is like a regular mean, but has the practical effect of throwing out outliers (unusual spikes in the data). This is a standard function in Excel, but not in most databases.
The standard method of calculating the geometric mean is by multiplying all of the terms together, then taking the n-th root of the product, where n is the number of terms. So if you had 5, 7, 55, 6, and 3 as terms, you would take the fifth root of 34,650, giving you 8.0899. Notice how the final result is not very much affected by the outlier (55).
This technique won't work in most databases for non-trivial datasets. Why? Because if you take 100 terms and multiply them, it will likely overflow even the largest column type. In fact, calculating the geometric mean this way with a calculator can quickly become impossible for large data sets.
There is a solution. If you take advantage of the magic of logarithms, you can calculate the geometric mean for very large data sets without overloading a standard relational database data type.
For a given set of terms, sum the log (I will use the natural log) of all terms. Then divide this sum by the number of terms. Finally, take the anti-log of the result (since I use a natural log, this means raising e to the result). Using our example:
ln( 5) = 1.6094
ln( 7) = 1.9459
ln(55) = 4.0073
ln( 6) = 1.7918
ln( 3) = 1.0986
-------
10.4531
10.4531/5 = 2.0906
e^2.0906 = 8.0899
which is what we got using the standard technique. This can be translated into SQL this way:
select exp(sum(log(sample)) / count(*))
from sample_table
-- or, depending on your flavor of SQL...
select exp(sum(ln(sample)) / count(*))
from sample_table
UPDATE: Isabelle pointed out that there is a neater way to do this. Since the sum(FOO)/count(*) is an average, the expression can be simplified to this:SQL Server
select exp(avg(log(sample)))
from sample_table
Oracle, PostGreSQL, MySQL
select exp(avg(ln(sample)))
from sample_table
Nicely figured out, Isabelle.
As far as I can tell, this will work in Oracle, SQL Server, and Access, and I can't think of why it would not work in MySQL or PosgreSQL.