19 Feb 2014

A social media data warning from Sherlock Holmes

In the literary classic A Scandal in Bohemia, consulting detective Sherlock Holmes warns us of a grave error that far too many commit, not only in forensic science, but in understanding the various claims made using data in the worlds of PR and marketing.

“I have no data yet. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

BBC_One_-_Sherlock__Series_3___Sherlock_Series_3Photo Credit: BBC One

Several people called to my attention an article yesterday stating that there was no correlation between social sharing and the actual consumption of content. This is quite a bold statement, and I’m certain a number of people shared the article (possibly without reading it). The first question we must ask ourselves isn’t what we should do about it, but whether the facts supports the theory.

How would you go about proving this?

“Data! Data! Data! I can’t make bricks without clay!” – Sherlock Holmes, The Adventure of the Copper Beeches

The answer lies in the text of the article itself: “Chartbeat’s lead data scientist Josh Schwartz later clarified to The Verge that Haile was talking specifically about tweets“. Let’s get some social media data to work with from Twitter. I downloaded all of the tweets I’ve posted with links to my own website over the past two months, which gives you the tweet as well as the number of favorites, retweets, and replies. This will tell me how many people are sharing, with or without reading the content.

Screenshot_2_19_14__7_12_AM

So far, so good. From there, I went into Google Analytics and created a filter to show only traffic from Twitter. I took the visits to each URL from Twitter and lined them up next to each of the corresponding Tweets.

Screenshot_2_19_14__7_23_AM

If the theory is correct, there should be no correlation between the number of social shares and the number of people who visited each article from Twitter. Let’s find out by running a standard Pearson regression analysis. Any statistics tool, including your favorite spreadsheet, can do this. The answer is:

SOFA_Statistics_Report_2014-02-19_07_25_32

In this particular dataset, there is a moderate correlation of 0.247 between visits to the URL and retweets. It is not “no correlation”, which would be a value of 0, nor is it a strong correlation, which would be a value of 1.

Updated: Ethan Jewett pointed out in the comments and on Twitter that one of my Tweets promoting my book is a significant outlier that unduly influences the correlation. Using a different regression method (Spearman instead of Pearson), we get a correlation of .14, which is significantly weaker.

So what does this mean? For this particular audience, there is a weak correlation between retweets and people actually consuming the content, or at least getting to the content. Thus, you can’t make a global, generalized declaration that social shares and content consumption have no relationship. They may in this dataset. The next logical step would be to test out a different dataset or increase the sample size to get a more firm conclusion, and to test it with both correlation methods.

What you can definitively say is that every brand and every publisher must do their own work to find out whether their particular audience does or does not consume the content they share. Don’t rely on someone else’s data when you have your own data to look at – and certainly don’t make business decisions about the future of your company from someone else’s dataset.

Christopher S. Penn
Vice President, Marketing Technology

Download our new eBook, PAID EARNED OWNED SHARED

Tweet about this on Twitter18Share on Facebook0Share on LinkedIn3Share on Google+1Email this to someonePin on Pinterest0Share on TumblrPrint this page
21 comments
esjewett
esjewett

@cspenn @ScottMonty Just eye-balling it, I don’t think there’s much of a correlation at all in your sample data set.

esjewett
esjewett

@cspenn @ScottMonty Did you know Pearson’s r is sensitive to outliers? Bad data set to use it on with that 80 retweet point.

ScottMonty
ScottMonty

I always appreciate a relevant Sherlock Holmes quote. Another, buried a little deeper in the Canon is from "The Adventure of Wisteria Lodge": 


"Still, it is an error to argue in front of your data. You find yourself insensibly twisting them round to fit your theories."



ScottMonty
ScottMonty

@cspenn Well played, sir. Another is "Data! Data! Data! I cannot make bricks without clay." - 'The Copper Beeches'

cspenn
cspenn

@BenZee Yes. I jumped up one level to ask if anyone even made it to the site in the first place.

cspenn
cspenn moderator

@ScottMonty  I haven't read that one in a while. Recently cruised through Adventures and Memoirs, Last Bow is next up :)

esjewett
esjewett

@cspenn @ScottMonty No, there’s not. Outliers can create the appearance of false correlations in a Pearson regression.

cspenn
cspenn

@esjewett Post text is updated with your feedback. Thank you :)

esjewett
esjewett

@cspenn Glad it’s appreciated. Mis-used statistical tests are really rife in the industry. Really bothers me. They are powerful tools.

esjewett
esjewett

@cspenn They aren’t comparable values, IIRC. Spearman takes a lot more work to determine significance.

esjewett
esjewett

@cspenn Trimming the tops/bottoms could be OK, though it will skew the data a little.

cspenn
cspenn

@esjewett It's definitely weaker than a .247. Thanks for the feedback - it's genuinely appreciated.

esjewett
esjewett

@cspenn Sounds more like the reality of the data set. You’d have to do some significance tests, but .14 is pretty close to 0.

cspenn
cspenn

@esjewett Spearman came in at .14 but with a much higher p. Might just be more sensible to trim top/bottoms, no?