It’s a common story. A bright young graduate student starts their research program with high ambitions. Six months later they’re staring at hundreds of genomes, thousands of pages of digital text, or hundreds of thousands of environmental measurements and wondering how to even begin to analyze them.
Their academic training hasn’t prepared them for the day-to-day challenge of organizing, managing, and analyzing large datasets – they’ve hit their data pain point. Even researchers working with relatively small datasets (for example, hundreds of survey responses) face challenges scaling up their fields’ traditional data management and analysis techniques for today’s highly technical, data-rich research landscape. In a recent survey of 704 principal investigators for National Science Foundation biology grants, the majority said their most important unmet data needs were not software or infrastructure, but training in data integration and data management.1
This lack of data skills is holding back progress toward more reproducible research by making it harder for researchers to share, review, and reanalyze one another’s data. In a recent survey by Nature, the top four solutions scientists identified for improving reproducibility related to better understanding of statistics and research design and improved mentoring, supervision, and teaching of researchers.2 Data skills need to be an integral part of academic training in order to ensure that research is reliable, transparent, and reproducible.
My organization, Data Carpentry, and our sister organization, Software Carpentry, are among the groups filling this gap by training researchers in the latest technologies and techniques for cleaning, organizing, cataloging, analyzing, and managing their data. We see this training as an important part of a larger project to transform academic culture to make research more reproducible and transparent.
Drowning in Data
New tools for gathering, storing, and sharing information have made an unprecedented amount of data available to researchers. For example, sequencing a full human genome now costs less than $1,000 and data repositories house massive amounts of genetic data for use by researchers and clinicians.3 The integration of technology into our day-to-day lives also produces a massive amount of data: A widely publicized study on the emotional tone of people’s Facebook activity involved nearly 700,000 subjects and millions of online posts.
Most researchers will need to interact with large datasets at some point in their careers. When they do, many realize they’re unprepared for the challenge. Being unfamiliar with computational tools and workflows, they may find themselves carrying out repetitive and error-prone tasks by hand. If they write in-house scripts for cleaning or analyzing their data, they may fail to document their code in a way that allows it to be checked and used by other researchers. If using code written by others, they may not properly test its utility for their dataset. They may fail to document the parameters they select and the software version they use, information that is important for other researchers seeking to replicate their results. Combined, these and other issues impose an enormous cost on both researchers’ productivity and the reliability and reproducibility of their results.
Our current approach to training academics doesn’t provide dedicated space for learning how to organize, clean, store, and otherwise manage data, because our model developed before this type of training was needed. Datasets were either small enough to be analyzed using simple computational tools or were handed off to data specialists. This is no longer the case, and researchers who aren’t prepared to handle data are forced either to teach themselves data skills piecemeal or limit themselves to questions that can be answered with smaller datasets and computationally simpler approaches. Without proper training, they may practice poor data hygiene and produce results that other researchers can’t understand or replicate.
Preparing a New Generation of Researchers
Ideally, training in how to organize, clean, store, and analyze data in reproducible and computationally sound ways would be an ongoing part of a researcher’s education starting early in their academic career. However, two major barriers have kept this ideal from becoming a reality: The need for data skills training isn’t widely recognized by those responsible for setting undergraduate and graduate curricula and there is a shortage of instructors with the expertise to teach data skills.
Overcoming these barriers requires changing the culture at universities and research institutions. We need a large body of early career researchers with the skills to be competent and confident in their data management and analysis and the passion to act as advocates for the importance of data skills, reproducible research, and data transparency at their institutions. Data Carpentry and Software Carpentry are training this new generation of data champions and, along the way, building an army of devoted instructors who can train others.
Data Carpentry creates and delivers hands-on, interactive workshops providing fundamental data skills to researchers around the world. Our goal is to empower researchers to manage and analyze their data in reproducible ways and to make their data and analyses available for others to review and reuse. Together with our sister organization, Software Carpentry, we’ve reached more than 6,000 learners in the past year in over 25 countries. We’ve also trained more than 800 volunteer instructors in evidence-based teaching practices and active learning strategies. These instructors often adapt our teaching practices and curricula (openly available under a Creative Commons license) for other contexts, spreading our impact further.
By focusing on helping data novices develop basic familiarity with a core toolkit and cultivate strategies for future self-directed education, we hope to establish the foundation for lifelong learning. This is essential because data management and analysis tools are continuously evolving, meaning that new techniques will need to be learned over the course of a researcher’s career. In addition, lifelong learners are more likely to become advocates for educating others.
Transforming Academic Culture to Value Data Sharing & Reproducibility
Our goal is to create an academic world where our organization is no longer necessary because researchers receive training in best practices for research and data management throughout their careers. Our strategy is to transform academic culture from the ground up by taking advantage of what we know about how cultural change works.4 Some of these strategies may be useful for others trying to create a lasting shift in academia toward reproducibility and transparency.
We know that it’s hard to change people’s attitudes, but it’s necessary in order to truly change their practices in a lasting way. Many researchers come to us because they’ve hit a point in their workflow where they can’t move forward without certain data skills. In addition to giving these ready-made allies the skills they need, we attempt to change their attitudes and turn them into advocates for data practices that promote reproducibility and transparency. By targeting researchers early in their careers, we ensure that our impact grows over time as they pass along these principles in their labs and classrooms.
We also know that people aren’t content to implement ready-made solutions, but want to modify strategies to match their own needs and contexts. Our curricula are collaboratively developed and extensively tested by our community, but we encourage our instructors to modify them to suit the students. Our lessons are also tailored to specific academic domains in order to reduce cognitive load and enable learners to directly apply the principles and techniques they learn to their own data. We want students to be able to walk out of our workshops and immediately use what they’ve learned, so they can see the lasting value of data skills for their field.
Catalyzing Long-Term Change
By turning early career researchers with specific skill needs into long-term advocates for data practices that support transparency and reproducibility, Data Carpentry hopes to catalyze change to research and instructional culture worldwide. Our workshops are in high demand, usually filling up within days of opening, and people are lining up to volunteer to teach with us. We’re building growth in new regions, including Central and South America and Africa, and academic domains in the social sciences and humanities. We’re also developing ways for local communities of researchers to continue learning after our workshops.
Together we’re working to make that common story of hitting a “data pain point” a less common one. That bright new graduate student might instead be a Carpentry instructor, bringing data literacy to a campus near you. They could also be a powerful advocate for an academic system where every researcher is equipped to organize and share data so that others can reproduce and reuse it, improving the quality and reliability of research.
This article is part of a series on how scholars are addressing the “reproducibility crisis” by making research more transparent and rigorous. The series was produced by Footnote and Stephanie Wykstra with support from the Laura and John Arnold Foundation. It was published on Footnote and Inside Higher Ed.
- Barone, L., Williams, J., and Micklos, D. (2017) “Unmet Needs for Analyzing Biological Big Data: A Survey of 704 NSF Principal Investigators [preprint],” BioRxiv.
- Baker, M. (2016) “1,500 scientists lift the lid on reproducibility,” Nature, 533(7604).
- Marx, V. (2013) “Biology: The big challenges of big data,” Nature, 498(7453): 255–260.
- Henderson, C., Beach, A., and Finkelstein, N. (2011) “Facilitating Change in Undergraduate STEM Instructional Practices: An Analytic Review of the Literature,” Journal of Research in Science Teaching, 48(8): 952-984.