Wednesday, December 24, 2014

When do people worry about hurricanes?


I recently read an article in the December issue of Significance titled, “Does Christmas really come earlier every year?” by Nathan Cunningham of the University College of Dublin.  His premise was that, by using cluster analysis of Google Trends data, we can see how people have begun thinking about the holidays earlier and earlier each year.  It’s a good read: http://www.statslife.org.uk/significance/1892.  I should note that Nathan graciously answered my emails asking for clarification and saw real value in this technique for emergency management work.


I decided to replicate his results using a FEMA-related search term: “hurricane”. 

Google Trends

Google Trends (http://www.google.com/trends/) allows you to view the volume of searches on particular terms.  The units are percentage of total Google searches.  For example, the week that Hurricane Katrina made landfall, “hurricane” scored almost 100; almost all searches were hurricane related. If you sign-on with your Google ID, you can also download the data to CSV.  Cunningham used Google Trends to analyze search volumes on holiday-related terms (“Christmas”, “Santa Claus”, etc).  Here I’ve compared the search terms “hurricane” and “tornado”.  You can see that there is a somewhat repetitive pattern of increase mid-year.  I wanted to explore this pattern.

Cluster Analysis

Cluster Analysis looks at data and organizes it into groups that share similarities.  Once Cunningham had each year’s data, he used cluster analysis to determine in which week of the year the volume of holiday-related searches began to increase.  Similar analysis can be done on FEMA-related search terms; a cluster analysis of the Google Trend data for the search term “hurricane” reveals continuous periods of increased interest for the following weeks from 2004-2014.  This was simple to implement using R (see code below).  The accompanying graphic shows the “shape” of the cluster; the x-axis is the week number of the year, and the y-axis is the percentage of all Google searches for the term “hurricane”.  In hindsight, it is possible to find explanations for these clusters; for example, 2005 and 2012 had periods of exceptionally high interest corresponding to the hurricane activity of those years.  2009 and 2013 had little activity (look at the y-axis) corresponding to light years.

Further Investigation

This simple example shows how cluster analysis can illustrate the behavior of data that have more than one pattern.  This could find application in data that vary from Region to Region or JFO to JFO, or changes with disaster type.

Although Cunningham used cluster analysis to look at Google Trends data, it is easy to see that the data returned also lend themselves to Time Series Analysis.


R Code used in this example


## Crow's nest Clustering example – Tim Allen

# Adapted from http://www.statslife.org.uk/significance/1892

# Nathan Cunningham - Does Christmas really come earlier every year?
# Significance Magazine 11 November 2014
# Allow multiple plots (2 rows x 6 columns)
par(mfrow=c(2,6))
# You have to install and load the mclust package
library(mclust)
# Calculate clusters for each year
for (yr in 2007:2013) {
# 1) load this year's data in a matrix
observations <- span="">as.matrix(subset(gtrends, year==yr, select=c("week","hurricane")))
# 2) find clusters based on models' BIC
fit <- span="">Mclust(observations, 2)
# 3) Plot the clusters and print the model summary
plot(fit, what="classification", xlab=yr)
print(summary(fit))
}

Acknowledgement

My sincere appreciation to Nathan Cunningham of  the University College of Dublin for his kind help in preparation of this article.  Please read his article, "Does Christmas really come earlier every year?"

Sunday, March 16, 2014

How to calculate a p-value for an ANOVA F-Statistic using R or a TI-84

At the end of calculating an Analysis of Variance (ANOVA), you have an F-statistic.  To get the p-value of the F-Statistic, you can use R or the TI-84:

For example, in an ANOVA with treatment degrees of freedom = 1 and error degrees of freedom = 10, you calculate the F-Statistic as 2.81 and want to know its p-value (that is, what is the probability of observing this F-Statistic under the Null Hypothesis?):

1) R
# The following calculates the p-value of an F-statistic
pf(q=2.81, df1=1, df2=10, lower.tail=FALSE)

 You'll get the answer:
[1] 0.1246126

2) TI-84
The syntax for the Fcdf command is:
Fcdf(lower limit, upper limit, numerator degrees of freedom, denominator degrees of freedom).
Fcdf(2.81, 99999, 1, 10)
 You'll get the answer:
[1] 0.1246126

I'm a student in the MS in Applied Statistics program and the University of the District of Columbia.