Life sciences and social sciences are producing an ever-increasing number of unreliable results, as suggested by the disturbingly low reproducibility of their published experiments.
The problem is as follows.
The possibility for third-party researchers to replicate a scientific study published by others, is one of the founding pillars of science. If you remove the replicability constraint, then no differentiation is possible between real science and fake science.
In some fields, such as epidemiology or large clinical trials or huge particle-accelerator physics experiments, it becomes foolishly impractical (e.g., expensive) to replicate experiments entirely. In these cases a lower-level control strategy is applied, called reproducibility: data sets and computer code, produced by and employed for conducting the experiment, are made available to others for verifying published results and carrying out alternative analyses.
About ten years ago, Stanford professor John Ioannidis noted that, increasingly, biomedical research experiments were becoming irreproducible (“Why most published research findings are false”, PLoS Medicine, 2005): if you studied a published experiment and set out to analyse its data, you would seldom come up with the same conclusions as the original authors.
This consciousness grew within the scientific community, with Ioannidis becoming one of the most quoted authors, until it made it to the New York Timesin 2014. From then on, the dirty little secret was no longer concealable.
Today, scientific repositories like Nature maintain lists of best practices and publish special issues on reproducible research.
The problem of social-science and life-science research (think of the grandiose studies to explore the effects of a new drug or medical procedure) having become largely irreproducible has been attributed to the combination of three causes.
One is the complexity of the experiments, often entailing the scrutiny of large samples of living individuals (humans or other animals), each representing in itself a complex organism. The second factor is the publish-or-peril atmosphere that dominates academia, infusing a sense of urgency in spreading results sometimes long before the author(s) herself is confident on their validity.
The third factor being blamed is what we could call the “data science myth”: the idea that as long as I have a huge set of useful data taken from the real world, someway or another I am bound to find a routine that will run over it and infer some logical result.
One example of such fallacy that has gained much attention of late is the abuse of p-values, a kind of inferential statistics from which researchers… infer much more than would be allowed by logic. As an example, p-value statistics can tell us if a drug does differentiate from placebo: however it does not tell us if the drug produces the intended effects, and more statistical tests are necessary to reach such conclusion. On the contrary, pharma studies often draw conclusions from the p-test only.
These distortions concern disciplines ranging from medicine to sociology, from economics to biology. In one semi-serious study aimed at debunking the mythology of p values, listening to music by the Beatles was found to make undergraduates younger; in another, eating chocolate helped people to lose weight. All supported by robust P values…
This has become so serious that in 2016 the American Statistical Association has issued a warning, explaining why too often the P-value measurement strategy is abused.
However this is not going to be easy. People with mathematical-logic backgrounds who have happened to work side-by-side with colleagues from social or life sciences, have sometimes found certain inference subtleties to go unappreciated. In fact, statistics is one of mathematics’ most difficult domains, if not the toughest, and it often challenges hard-science folks too, not excluding mathematicians.
And there is more. «Decisions that are made earlier in data analysis have a much greater impact [than P value analysis] on results — from experimental design to batch effects, lack of adjustment for confounding factors, or simple measurement error. Arbitrary levels of statistical significance can be achieved by changing the ways in which data are cleaned, summarized or modeled». Or, to state it in one of my favorite mottos: If tortured long enough, data will confess anything…
This reproducibility crisis is a symptom of the data-science myth/abuse: the idea that we can automatically infer conclusions from data, which causes a declining attention on the mechanisms of logic.
Ultra-powerful Artificial Intelligence software contributes to consolidate such myth. DeepBlue showed long ago that it could take all the logical steps necessary to beating a Chess grandmaster. AlphaGo demonstrates intuition by inferring a good Go move from a pattern-matching exercise. Watson infers the meaning of a human phrase using more or less the same strategy…
The extensive and now irreplaceable use of computers in fields such as computational biology and many others, may fool even very skilled humans, such as researchers, into thinking that the answer is in the data, always. And, of course, it’s Big.