# The Citizen Analyst Manifesto, Part 8: Bring Persistence

In the Japanese martial art I practice, persistence is one of our core values. “Keep going!” exhorts our head teacher. Keep going! Play! His instruction is equally valuable for the Citizen Analyst. Data rarely yields its full value on the first pass, at surface level. Why? Three things make data difficult to analyze, especially in larger quantities: hidden correlations, obfuscation, and interference. How can we convince data to yield its secrets to us? We play. We keep going.

Hidden Correlations

Hidden correlations are logical and narrative connections within our data we can’t see. Once we start working with data sets with lots of different variables, many rows, many columns, we lose the ability to discern patterns and connections. Even simple calculations like Pearson correlations become unhelpful.

We see hidden correlations at work with very large phenomena like weather and the stock market. What makes a stock go up? What causes El Niño? We’ll find no simple, easy answer to complex questions like these. To find the truth in our data, we must keep going, keep digging.

To defeat hidden correlations, we use tools like IBM’s Watson Analytics, which can perform multiple regressions, linear analysis of variance, and many other statistically complex operations to uncover hidden gems. Watson Analytics can do far more sophisticated math, at a much larger scale, than most of us can do on our personal computers.

Obfuscation

Obfuscation, our second challenge, occurs when others accidentally or intentionally hide important data. Politicians are renowned for obfuscating data, willfully ignoring large contradictory data sets or cherry-picking only data that fits their agenda. In other cases, we may be working with data that’s incomplete or conflated.

For example, we often hear politicians intentionally mix up median and average statistics, such as average household income. In the United States, the average household income according to the US Census Bureau was \$72,641. The median household income was \$51,939. A politician might say, “this extra tax of \$3,500 per year is only 4.8% of the average household’s income, they’ll barely feel it”. However, the median household – which is more reliable when measuring any dataset with significant outliers like super-wealthy or super-poor households – would feel almost a 7% loss in income.

To defeat obfuscation, we search for additional data. What else could we find to complete our puzzle? What data is a crafty politician hiding? Services like Kaggle Datasets and Data.gov provide rich resources for us to supplement the data we’ve been given with additional context.

Interference

Interference, our third challenge, occurs when a hidden variable interferes with our data in ways we cannot see from data alone. Analysis without insight will draw erroneous conclusions. We can explain what happened in the data, but the data alone cannot explain why.

The textbook example of interference is the strong correlation between ice cream consumption and drowning deaths. A machine might be able to identify a correlation, but cannot identify the cause.

To defeat interference, we apply our insight, repeatedly asking why for every conclusion and data point until we find the root cause of our correlation. Parents of children ages 4-8 understand this process perfectly: ask why, over and over again. Why does ice cream sell better during certain months? Why do drownings increase more during certain months?

Data whispers its secrets to few. Only through persistence will we convince our data to whisper its secrets to us. Defeating the demons of hidden correlation, obfuscation, and interference requires us to keep going, to play, to explore, to add more context. Once we do, the secrets of data will be ours.

[camtable]

Christopher S. Penn
Vice President, Marketing Technology

[cta]