One of the fundamental principles of science is reproducibility – the idea that a discovery is valid only if any scientist in any lab can conduct the same experiment under the same conditions and obtain the same results. Without reproducibility, we could not distinguish scientific fact from error or chance, and scientific “laws” would vary from place to place and scientist to scientist.
While reproducibility is an essential principle of the scientific process, it isn’t always easy to achieve. Recent1 studies2 in the field of biomedicine show that findings from an alarming percentage of scientific papers in even the top journals cannot be reliably reproduced by other researchers. Why does science fail to meet the basic standard of reproducibility? The current state of affairs results from a combination of the complex nature of modern scientific research, a lack of accountability for researchers, and the incentives created by a publish-or-perish culture in academia.
When I began my Ph.D. in Biological Sciences at Columbia in 2007, I had no biomedical research experience and my understanding of conducting and publishing research was, like many non-scientists, limited. I knew that researchers design and perform experiments, write up their findings, and submit them to a journal for publication. In the review process, peers with relevant expertise evaluate the experimental design and findings and deem a paper acceptable for publication only if it meets standards of quality and novelty. These standards include proper study design and appropriate methods, analysis, and statistical testing.
What I did not appreciate as an inchoate graduate student was that the generation and evaluation of data are extremely complex and subject to error and bias. While the scientific method and peer review are the best way to study the world around us, they are far from perfect. Identical studies by different labs should yield the same results, yet they often fail to do so. And while competing models or theories in the scientific literature are healthy byproducts of the scientific process, a lack of reproducibility is not.
Unfortunately, basic biomedical research(a) has a reproducibility problem, now widely acknowledged after systematic analyses by two biotechnology companies revealed that major findings in published papers could be reproduced for less than a quarter of the papers reviewed. One study examined 67 articles and the authors were able to replicate the results of only 25% of the studies.1 While the reproducible results were robust – that is, they were sustained using a variety of tests – the results from the other 75% of studies could not be reproduced even when the methods outlined in the original papers were replicated exactly.
In the other study, when a result could not be reproduced, the lab that conducted the original experiment was consulted about their methods and, in some cases, asked to repeat the experiment themselves.2 Even with this outreach to the original researchers, the major findings of just 10% of the 53 papers analyzed could be reproduced. In both studies, reproducibility did not correlate with the quality or rank of the journal that published the research.(b) Reproducibility problems were identified even in top journals like Nature, Science, and Cell, which tend to publish groundbreaking studies and have special clout within the scientific community.
The aforementioned studies collectively examined 120 papers relevant to the companies’ drug discovery programs, which only represents a tiny fraction of the overall field.(c) However, there is a general consensus among nearly every researcher I know, and increasingly the scientific community as a whole, that reproducibility is a common problem. The scope of the problem is documented in a recent survey where half of researchers reported being unable to reproduce a published finding at some point.2
In my experience, most researchers agree that the principal cause is not outright fraud, but rather the highly technical and disjointed nature of scientific research.(d) Biomedical researchers study extremely complex, often microscopic, biological systems. Experimental protocols frequently span days and require sophisticated instruments and hundreds of reagents – the chemical and biological ingredients for an experiment. Every lab uses different equipment, different standard protocols, different brands of reagents, and different scientists. While descriptions of materials and methods are included in published papers, these sections often lack sufficient detail about all these potential sources of variation.
The complexity of experimental procedures can even prevent researchers from reproducing their own data. For example, I recently found that a protein whose structure I had previously characterized (i.e. analyzed) was showing different characteristics in later analyses. I often produce proteins by inducing their expression (i.e. production) in a harmless strain of E. coli, then isolating the proteins from the bacteria. In examining my notes, I noticed that in my recent preparations, I was chilling the bacteria prior to inducing protein expression, which I had not done in my earlier experiments. After further testing, I found that the change in protocol caused this particular protein to misfold. It is likely that variations of this sort, which scientists may not expect to alter results, lead to different outcomes across experiments.
Human error can also cause significant variation, especially when measurements are sensitive or results rely on subtle differences in these measurements. Biomedical research is increasingly specific and focused on determining ever-smaller components of complex biological systems and detecting very minor changes to the these systems.(e) The smaller the measurement, the more error can impact it and the more likely it is that variations in experimental protocols or analyses will result.
While these problems are inherent to complex scientific experimentation, there is also a very relaxed attitude among researchers and publishers regarding the rigorous reporting of materials and methods, information that is essential for other scientists to accurately reproduce experiments. One study found that only half of all reagents mentioned in over 200 recent articles from a range of journals and fields could be adequately identified, indicating a failure of researchers to comprehensively report the reagents they use and of editors and reviewers to require such reporting.5 Again, these shortcomings were found across a wide range of journals and were not correlated with journal impact factor. Many researchers can also attest to an infinite regression of materials and methods: a protocol in Smith et al. references Jones et al., which references Frank et al., which references Scott et al., which was published in 1967. In this game of telephone, important details are lost along the way, making it difficult to precisely repeat experiments in a different lab.
In addition to a lack of precision in reporting materials and methods, some scientists display a lack of rigor in their use of statistics. In my experience, not enough attention is paid to statistics when designing and analyzing experiments, in part because little formal training is provided in the use of specific statistics appropriate for a given field or research environment. A common standard is to perform three independent iterations of a given experiment. Beyond this, there are no guidelines regarding how many data points each experiment should measure, and many times this is chosen according to what is most convenient for the researcher.
When designing experiments, researchers may neglect to consider the statistical importance of issues like the number of subjects being studied. For example, a recent review found that the average statistical power in neuroscience studies was “very low” due to small sample sizes and effect sizes.4 On the other side of the spectrum, statistical analyses become extremely complex in research involving large data sets, such as studies of gene expression profiles for the approximately 30,000 human genes. When datasets are large and statistical analyses are complex, it is challenging for the average scientist reading a paper to evaluate their quality. These problems are exacerbated by editors and reviewers who may not closely examine study design or statistical tests. Reviewing a manuscript thoroughly requires a significant investment of time – a commodity in short supply for most reviewers, who are busy running their own labs.
While some failures of rigor on the part of scientists may be due to busyness, lack of training, or relaxed attitudes in the field, there are also less benign reasons for lack of reproducibility, such as the cutthroat culture of “publish or perish.” The advancement of scientists at all stages of their careers rests heavily on their publication records. Graduate students need published papers to receive their doctorates and postdoctoral researchers and principal investigators (leaders of labs) need publications to secure funding grants, which they need to acquire jobs and eventually tenure. A scarcity of funding and academic positions has intensified the competition to publish in high-impact journals. Even for individuals leaving academia for the private sector, publication records are used to measure productivity and gauge potential in other professions.
Unfortunately, scientists are typically evaluated based on the number of papers they have published and the quality of the journals in which they have published, but not on whether their findings can be reproduced. The “publish or perish” culture drives researchers to dig for significant results they can publish, and in the process may create subtle biases to report results in a manner that inflates the importance of a study and, by proxy, its authors. Whole sets of experiments that do not fit squarely with a hypothesis may be omitted from the published work to make the findings seem more convincing. The motivation to find significant results can also creep into the more subjective parts of data analysis, such as quantifying subtle changes in protein levels or visually evaluating the distribution of a protein within a cell. To obscure such ambiguity – and retain possession of valuable techniques – researchers may intentionally include vague materials and methods sections.(f)
While the catalog of challenges to ensuring that scientific findings are reproducible may seem daunting, there are also inherent incentives for researchers to do high-quality, reproducible science. No one wants to publish hastily only to later be proven wrong. More importantly, scientists want to accurately understand natural processes in order to develop therapeutic possibilities and satisfy their curiosities about the natural world. Reproducibility ensures that we approximate some truth about the complex systems we study, and this quest for truth is the ultimate purpose of scientific inquiry. In my next article, I will be examining how we can come closer to the truth we seek by exploring steps that individual scientists and the field as a whole can take to address obstacles to reproducibility and encourage the production of high-quality science.
This article is the first in a two-part series. It addresses the causes of the reproducibility problem in biomedical research, while the second article examines potential solutions.
- Florian Prinz, Thomas Schlange, and Khusru Asadullah (2011) Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10(9): 712. C.
- Glenn Begley and Lee M. Ellis (2012) Drug development: Raise standards for preclinical cancer research, Nature, 483: 531-533.
- Aaron Mobley, Suzanne K. Linder, Russell Braeuer, Lee M. Ellis, and Leonard Zwelling (2013) A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic, PLoS ONE, 8.
- Ferric C. Fang, R. Grant Steen, and Arturo Casadevall (2012) Misconduct accounts for the majority of retracted scientific publications, Proceedings of the National Academy of Sciences, 109 (42): 17028-17033.
- Nicole A. Vasilevsky, Matthew H. Brush, Holly Paddock, Laura Ponting, Shreejoy J. Tripathy, Gregory M. LaRocca, and Melissa A. Haendel (2013) On the reproducibility of science: unique identification of research resources in the biomedical literature, PeerJ, 1, e:148.
- Katherine S. Button, John P. A. Ioannidis, Claire Mokrysz, Brian A. Nosek, Jonathan Flint, Emma S. J. Robinson, and Marcus R. Munafò (2013) Power failure: why small sample size undermines the reliability of neuroscience, Nature Reviews Neuroscience, 14: 365-376.
- (a) Basic biomedical research examines the fundamental mechanisms of molecular, cellular, and physiological processes and generally does not involve human subjects. At the other end of the spectrum are clinical studies that test specific treatments for disorders in humans.
- (b) Journal quality is typically evaluated using a measure called the impact factor, which ranks journals based on the frequency with which other journals cite their articles.
- (c) One of the main catalogs of published biomedical research, PubMed, contains over 23 million citations.
- (d) When falsified data or gross errors that invalidate findings come to light, researchers often choose or are pressured to retract a paper. The overall retraction rate is very low: only 0.01% of scientific papers have been withdrawn after publication, though this number has increased tenfold since the 1970s.3 The infrequency of retraction suggests that the kind of fraud that results in retraction is a much smaller problem than lack of reproducibility, which is generally not seen as grounds for retraction.
- (e) In 2008, for example, Nature’s Method of the Year and the runner-up for Science’s Breakthrough of the Year both involved advances in microscopy that allowed for unprecedented resolutions in viewing intracellular biology and vertebrate development, respectively.
- (f) The “overly honest methods” hashtag on Twitter and other social media provides a humorous glimpse into the role convenience can play in methodology choices.