A social media data warning from Sherlock Holmes

In the literary classic A Scandal in Bohemia, consulting detective Sherlock Holmes warns us of a grave error that far too many commit, not only in forensic science, but in understanding the various claims made using data in the worlds of PR and marketing.

“I have no data yet. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

BBC_One_-_Sherlock__Series_3___Sherlock_Series_3Photo Credit: BBC One

Several people called to my attention an article yesterday stating that there was no correlation between social sharing and the actual consumption of content. This is quite a bold statement, and I’m certain a number of people shared the article (possibly without reading it). The first question we must ask ourselves isn’t what we should do about it, but whether the facts supports the theory.

How would you go about proving this?

“Data! Data! Data! I can’t make bricks without clay!” – Sherlock Holmes, The Adventure of the Copper Beeches

The answer lies in the text of the article itself: “Chartbeat’s lead data scientist Josh Schwartz later clarified to The Verge that Haile was talking specifically about tweets“. Let’s get some social media data to work with from Twitter. I downloaded all of the tweets I’ve posted with links to my own website over the past two months, which gives you the tweet as well as the number of favorites, retweets, and replies. This will tell me how many people are sharing, with or without reading the content.


So far, so good. From there, I went into Google Analytics and created a filter to show only traffic from Twitter. I took the visits to each URL from Twitter and lined them up next to each of the corresponding Tweets.


If the theory is correct, there should be no correlation between the number of social shares and the number of people who visited each article from Twitter. Let’s find out by running a standard Pearson regression analysis. Any statistics tool, including your favorite spreadsheet, can do this. The answer is:


In this particular dataset, there is a moderate correlation of 0.247 between visits to the URL and retweets. It is not “no correlation”, which would be a value of 0, nor is it a strong correlation, which would be a value of 1.

Updated: Ethan Jewett pointed out in the comments and on Twitter that one of my Tweets promoting my book is a significant outlier that unduly influences the correlation. Using a different regression method (Spearman instead of Pearson), we get a correlation of .14, which is significantly weaker.

So what does this mean? For this particular audience, there is a weak correlation between retweets and people actually consuming the content, or at least getting to the content. Thus, you can’t make a global, generalized declaration that social shares and content consumption have no relationship. They may in this dataset. The next logical step would be to test out a different dataset or increase the sample size to get a more firm conclusion, and to test it with both correlation methods.

What you can definitively say is that every brand and every publisher must do their own work to find out whether their particular audience does or does not consume the content they share. Don’t rely on someone else’s data when you have your own data to look at – and certainly don’t make business decisions about the future of your company from someone else’s dataset.

Christopher S. Penn
Vice President, Marketing Technology


Keep in Touch

Want fresh perspective on communications trends & strategy? Sign up for the SHIFT/ahead newsletter.

Ready to shift ahead?

Let's talk