We all too often believe in data’s power of indisputable “truth.” However, we fail to ask key questions of data analysis, which leads us to blindly follow “data-driven truths.”
This is not to say that analysis should be scrutinised down to the last line of code or row, but that we should probe to understand how fit-for-purpose the data insight is. For instance, you wouldn’t make consequential business decisions from a statistical model with a small sample size (statistically significant or not).
How is one to know without asking or being informed?
We obviously don’t want to review raw data manually all the time, so why do things go wrong?
Data Visualisation - How the Same Data Can Show Two Different Stories
From a young age, we are all taught to read graphs a certain way so that we can interpret one in seconds, and the best graphs should do just that. However, this is one of the most obvious ways in which data can be misinterpreted (or, unfortunately, manipulated).
Look at the graphs below—both use the same data, but by amending the y-axis, we can tell a remarkably different story. Neither is incorrect, but both are extremes that require the interpreter to come to their own conclusion.
Is the graph on the left too blasé about a profit decline, or is the graph on the right too alarming? Both will need more context and data for the reader to discern the most appropriate visualisation.
Using The Right Tool for Interpreting Data
Often, danger strikes when we don’t understand how specific tools function or when we use them for the sake of doing so. But it’s using our data and outputting some results, so how wrong can it be?
Again, the results aren’t wrong, but is this the best tool to provide the best solution?
Selecting the right tool and understanding its caveats is pivotal to making informed decisions.
Below is the London Fire Brigade data of incidents within Camden from 2009 to 2017 (left). One useful insight for the LFB would be to locate high concentration of incidents to allocate resources to nearby stations.
We can apply a k-means algorithm (some machine learning because everyone is doing it these days!) to find clusters of fire incidents.
Even though we’ve applied machine learning and used a sufficiently large dataset, is this enough analysis to pass on to someone else? Probably not.
Only by understanding how the algorithm works would you know that you have to specify how many clusters you expect to find, in this case, six, which is biased in itself. The algorithm also has a random element when finding clusters, so if we were to run the process twice more, we would arrive at two different outputs below.
So which one is correct?
All of them are not fit for purpose. Regardless of how complex the algorithm is, it may be clear that we are using the wrong tool for the wrong job.
The Dreaded Data Dredging
Also known as p-hacking, data dredging is a common pitfall when working with data, especially in academia. Data dredging is when one looks for any statistical significance within data and selectively pursues significant results rather than testing a single hypothesis.
It often occurs from pressure from employers or funders to publish statistically significant research.
A common phrase is “correlation does not imply causation”.
Just because something is statistically significant to a 99% confidence level does not make it true. For example, the chart below shows a correlation between the consumption of mozzarella cheese and the number of civil engineering doctorates awarded. We could (blindly) conclude that having more civil engineers causes the consumption of cheese to increase. This is unlikely to be true.
However, let’s consider underlying factors of both.
The more affluent one is, the more disposable income one has to buy cheese, which also increases the likelihood of attending university. So, there may be a connection between engineering degrees and cheese consumption, but the causal relationship might lie with affluence rather than because the data and statistical tests say so.
So, what questions should you be asking regarding Data Interpretation?
How was the data obtained?
Why are we doing this type of analysis?
How does the tool work?
What are the limitations of this analysis?
What’s the theory or rationale behind the numbers?
Next time, ask us about our analysis. We’d love to prove it.
Comments