Is data special?

This post is inspired by a twitter thread on whether you should trust a summary statistic (mean/ standard deviation/ Pearson’s correlation coefficient) without seeing a plot. Most people voted “no”, which seems to be motivated by a sentiment that accepting summary statistics without seeing the plot is trusting too much. See the full thread below.

The discussion is intriguing. As someone who studies methodology, my natural response to “you can’t trust statistics without looking at data” was, naturally, “you can’t trust data without looking at methods, either”. One can, of course, extend the skepticism further and question the interpretability of methods without a good grasp on the sociopolitical context. Following Matt’s terminology, I’d like to discuss the idea of “consuming information” a little further.

First, let’s talk about trust. As Matt points out, “With any data solution, you are trusting SOMEONE ELSE”. Researchers who reject certain summary statistics presumably do not trust the people who are producing those statistics. Why not? Or, rather, what does it mean to trust or not trust others in this context?

We can distinguish two kinds of trust at play here. The first is “ethical trust”, which is trusting someone to report honestly. Even though part of the skepticism over statistics without plot is motivated by a desire to check for misrepresentation of data, I don’t think ethical trust is the right conception. One can fake a plot, after all, and it’s extremely difficult to check for this by looking at the plot. Dishonesty in the form of deception is difficult to spot by reading the (possibly fake) report. It’s not like if I don’t trust someone to honestly report a mean, I would trust them to honestly present a plot. Moreover, judgments of ill intentions and dishonesty are always hard to defend.

Instead of not trusting that someone is reporting honestly, we may also not trust that someone has made the right inference. We can call this “epistemic trust”.  Distrust of this kind may happen if I don’t trust your epistemic ability or if I don’t trust that you are trying to solve the same problem as me. But it can also happen when my job is to double-check you and so I’m not supposed to trust you. For example, if I’m working on a project showing how certain accounting method misrepresent how much financial asset a firm holds, then it makes little sense for me to trust existing studies that use this accounting method to assess financial asset, even if I have high epistemic regard of their authors.

From this perspective, the lack of trust does not appear as depressing anymore. In particular, just because many researchers do not “trust” a process, it does not necessarily mean that the process or the people using it are not “trustworthy” (in the colloquial sense). Instead, it means something more like this: there are a lot of ways to draw inferences; there are a lot of circumstances that call for different inferences being drawn in different ways; the chances that two scholars happen to share the same research context that calls for the same kind of inference is pretty low. Consequently, I should not “trust” another scholar’s inference, because the chances that their strategy is the best fit for my purpose is pretty low. It’s in my best interest to double-check the inference process.

All of this is not to say that poor scholarships do not exist or it’s not the case that certain inferences are more prone to error or misuse. However, what I think is true is that mistrust in the epistemic sense is often healthy.

Putting trust in this way — as an attitude towards a process rather than towards an epistemic agent — also highlights how it is unavoidable. If I mistrust a field in the ethical sense, then perhaps there exists one person who has gained my wholehearted trust and, as long as that person approves of something, I would trust it. That person then acts as an information filter for me in that everything going through them would be honest. However, if I mistrust a field in the epistemic sense, then it’s not obvious at all that such a filter exists or is desirable. Moreover, if a field is such that there does not exist foolproof inference rules that are general and generally applicable, such as in the social sciences, then epistemic distrust is desirable — it is a sign of academic care.

Another advantage in conceiving distrust in this way is that it helps us see that the exercise of interpreting data and drawing conclusions is one about filtering and organizing information. There are always a number of different inferences one can make with a given dataset, depending on what one’s research goal is or which “big picture” does one see this dataset as filling in. To be clear, I’m not saying that a dataset always supports multiple competing theories (though this is probably not far from true). I’m simply saying that a dataset can do multiple jobs in multiple research settings. This claim shouldn’t be controversial.

What is interesting, however, is that once put in this way, there really isn’t anything special about the data stage. We can see a research process as something like this: one reads background literature to come up with a theory/hypothesis; one settles on an experimental design, informed both by existing literature and practical constraints; one carries out this design, making numerous practical decisions regarding implementations; one chooses to conduct certain statistical analyses, arriving at certain conclusions; finally, one writes up these conclusions into a full-fledged story that tells us something about the subject matter. Matt’s tweet highlights the level of (epistemic) distrust concerning the researcher’s choice of conducting certain statistical analyses or interpreting results in certain ways. But these choices do not really differ in kind from the researcher’s earlier choices concerning the framing of a problem or the carrying out of a research design. If we take the final product of the research to be “we have learned X about Y”, then all these other choices have just as much potential as the statistics at making this product inadequate.

When someone says “my study has shown X about Y” and (we think) they are wrong, there can be many reasons. Misrepresenting data and misinterpreting statistics are perhaps two ways that are being discussed the most in the metascience literature. (Either that, or it’s a result of me following too many statisticians on twitter.) It’s important to talk about other ways we can be wrong, too.

Kino
Latest posts by Kino (see all)