Science is facing a crisis where error, omission, and fraud are threatening the quality of evidence we rely on to make decisions like which medical treatments to use and which social programs to fund.1–3 One evidence-based strategy to improve the quality of scientific research is the adoption of reproducible research practices.
Using reproducible research practices speeds up scientific discovery, fosters greater exchange of ideas among scientists, and reduces research waste.4,5 Research papers with shared data have fewer errors, those with shared data or shared code are cited more, and shared data are used in more publications.5,6
The modules in this toolkit will guide you through building a reproducible research project. Pastry chefs know that to re-create that most excellent souffle, they must have the same ingredients and a clear and complete recipe to follow. Likewise, to reproduce scientific research, scientists need access to data and a clear and complete recipe for data management and analysis.4,7,8 After this brief introductory module, the other four modules in our toolkit will help you learn:
We invited several researchers to help test and improve the toolkit. The researchers used the toolkit to format, organize, and post materials from a recent publication. The GitHub repositories for these projects are available as examples:
For best results, we suggest reviewing each module in its entirety before implementing the changes. Use our reproducibility checklist to keep track of your progress as you go through the modules.
If you want to know more about reproducible research, there are many other publications9–19 and online resources available.
We know that everyone has a favorite, go-to statistical software platform. Throughout the modules, we provide reproducible examples using four platforms popular in public health: R, SAS, SPSS, and Stata. We hope that this will make our modules useful and accessible for everyone.
One overarching recommendation we make is the use of non-proprietary file formats when making data and code public. In all or most platforms, data can be saved or exported as a comma-separated values (CSV) format, and in many, code or codebooks can be saved as a PDF or native format. If you use a platform where codebooks cannot be saved out in a non-proprietary format, see the examples in Module 4: How to document data and code for sharing. These types of files (e.g., CSVs, PDFs, TXTs) and their kin can then be imported back into most other platforms so that end users can reproduce work in their preferred software. We have two main motivations for promoting the use of non-proprietary formats for reproducibility: 1) Using non-proprietary file formats increases accessibility to code and data and widens the audience of potential reproducers of your awesome work; and 2) Money. If you are no longer a student, think back to when you found a perfect dataset, but did not have access to Stata or SAS, and couldn’t even think of buying it.
1. Fang FC, Steen RG, Casadevall A. Misconduct accounts for the majority of retracted scientific publications. Proc Natl Acad Sci USA. 2012;109(42):17028-17033.
2. Steen RG. Retractions in the scientific literature: Is the incidence of research fraud increasing? J Med Ethics. 2011;37(4):249-253.
3. Prinz F, Schlange T, Asadullah K. Believe it or not: How much can we rely on published data on potential drug targets? Nat Rev Drug Discov. 2011;10(9):712-712.
4. Peng RD, Dominici F, Zeger SL. Reproducible epidemiologic research. Am J Epidemiol. 2006;163(9):783-789.
5. McKiernan EC, Bourne PE, Brown CT, et al. How open science helps researchers succeed. eLife. 2016;5:e16800.
6. Piwowar HA, Day RS, Fridsma DB. Sharing detailed research data is associated with increased citation rate. PloS One. 2007;2(3):e308.
7. Peng R. The reproducibility crisis in science: A statistical counterattack. Significance. 2015;12(3):30-32.
8. Towards transparency. Nat Geosci. 2014;7:777.
9. Ali-Khan SE, Harris LW, Gold ER. Motivating participation in open science by examining researcher incentives. eLife. 2017;6:10.7554/eLife.29319.
10. Anderson CJ, Bahník Š, Barnett-Cowan M, et al. Response to comment on Estimating the reproducibility of psychological science. Science. 2016;351(6277):1037-1037.
11. Begley CG, Ioannidis JP. Reproducibility in science: Improving the standard for basic and preclinical research. Circ Res. 2015;116(1):116-126.
12. Camerer CF, Dreber A, Forsell E, et al. Evaluating replicability of laboratory experiments in economics. Science. 2016;351(6280):1433-1436.
13. Goodman S, Greenland S. Assessing the unreliability of the medical literature: A response to “why most published research findings are false” 2007. Johns Hopkins University, Department of Biostatistics. Available at: http://www.bepress.com/jhubiostat/paper135. Accessed 1 December 2017.
14. Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K. A comprehensive review of genetic association studies. Genet Med. 2002;4(2):45-61.
15. Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124.
16. Ioannidis JP. How to make more published research true. PLoS Med. 2014;11(10):e1001747.
17. Laine C, Goodman SN, Griswold ME, Sox HC. Reproducible research: Moving toward research the public can really trust. Ann Intern Med. 2007;146(6):450-453.
18. Collaboration OS. Estimating the reproducibility of psychological science. Science. 2015;349(6251).
19. Peng RD. Reproducible research in computational science. Science. 2011;334(6060):1226-1227.