It’s safe to assume that you’ve heard the old adage, “correlation does not imply causation,” but if you haven’t, just know this: it’s not safe to assume that correlation and causation go hand-in-hand.
Often times, when comparing two data points or sets, the lack of a relationship between correlation and causation is abundantly clear. And yet, other times, correlation can trick you into making some murky, hilarious, and downright incorrect conclusions about causation. Before we dive into an example, let’s take a look at the tool we’ll be using to explore correlation.
Google offers a free, fun, and useful tool called Google Correlate that is simple and intuitive to use.
Here’s what it can do:
Compare Time Series
- Upload your own time series data set and Google Correlate will find search terms that vary in popularity in a similar way to your own.
- Enter a search query to identify other search terms with a similar activity pattern
Compare US States
- Upload your own US states dataset to find terms with a pattern of activity similar to yours across the United States.
- Enter a search query to find what other search queries correlate state-by-state
Compare Web Search Activity Over Time to a Random Line You Draw
- This is a Correlate Labs tool that is fun to play around with, but should by no means be used for serious matters
- Fun Fact: My randomly drawn line has a 0.8979 correlation coefficient with searches for “Verizon Fios.” (What does it mean!?)
I could elaborate, but Google made a clever comic book with all the details so you should just read that instead.
Here’s how it can help:
Let’s pretend you own a Vietnamese restaurant in New York and you are searching for keywords to use in your Google AdWords campaign. You search for correlations with the keyword “pho,” and see that the search query “wok” has a high 0.9830 correlation coefficient value.
Adding “wok,” “pho saigon,” or “best pho” as keywords into your AdWords campaign is a promising strategy (depending on keyword search volume, competition, and the like).
A quick look at the time series, further down on the page, shows a clear spike in searches for both “pho” and “wok” leading up to and declining from January–where searches peak. You would be wise to increase your budget for these keywords over the winter.
Next, let’s look at the correlation amongst searches by state. Select any of the many, highly-correlated search queries Google provides. I picked, “how do you pronounce pho.”
The left map represents search activity for “pho” and the right map represents search activity for “how do you pronounce pho.” The darker states represent a higher popularity for these search terms.
The Scatter plot option presents an easier way to grasp the data. In both the state maps or scatter plot options, measurements represent the number of standard deviations away from that term’s search activity mean.
The data points (in this case, states) in the upper right quadrant have a higher correlation value between the two search queries. The data points (again, states here) in the bottom left quadrant are negatively correlated between these two searches.
So, what can we assume? (Notice, I use the word assume here.)
- The writer of this blog post is hungry and may be craving pho
- Residents of Washington and Colorado are having a particularly hard time pronouncing pho
What don’t we know?
Often times there are additional variables, called confounding variables, which play into the big picture, but which you may fail to account for. In this example, some possible confounding variables include, but are not limited to, the following:
- Quantity of Vietnamese restaurants by state
- Appetites for Vietnamese cuisine
- Altitude and climate (which may encourage greater pho consumption)
Is there any hope for establishing causation?
The only way to prove causation is through the scientific method and rigorously controlled experiments. You must demonstrate, through experiment, that your hypothesis is true, and the experiment’s results are reproducible. For example, you could prove a hypothesis through a consumer survey or focus group that seeks to provide an explanation, and then replicate the result in a new group to be sure. If you succeed in this quest, go here.
What can’t we assume?
Residents of Washington and Colorado are struggling to pronounce pho, or lack confidence in their ability to correctly do so, because they are residents of Washington and Colorado
Correlation does not imply causation. Correlation literally only shows that two data sets are correlated, but does not explain why. Correlation can really only ever hint at causation, but any assumptions made about causation from correlation will likely only ever be assumptions. So just pho-get-about-it.
When it comes to playing around with Google Correlate for fun (hey, some people might), assumptions are fairly harmless. However, if you’re going to gamble with real business decisions, and their impacts, on the basis of assumptions, I’d make a bet that there’s real causation between your decision to do so, and its negative consequences. For example, starting a pho education workshop in Portland, Oregon may not prove to be so fruitful, but it does sound like an excellent plot for a Portlandia episode.
All jokes aside, it’s important to keep the “correlation does not imply causation” adage in mind when interpreting or acting upon data. Google Correlate can provide a plethora of interesting and practical information to guide your PR and marketing strategies–just be smart about how you use it. I highly recommend reading the Google Correlate Tutorial before using this tool; it provides a breakdown of how it works and how to interpret the data in user-friendly, non-statistics-professor language.
As a parting note, and final example(s) of the sheer ludicrousness of confusing correlation with causation, I leave you this graph and a link to many other hilarious correlations curated by Tyler Vigen.
But, just in case, I’m going to stay off of Amazon until I can get my hands on some pho.
Follow Natalie Cullings on Twitter at @NatalieCullings