Reproducible data analysis workflows with R and RStudio

A talk for postgraduate students in the College of Medicine and Health, at University College Cork (by Darren Dahly and Brendan Palmer)

Warning: This post is “under-development” so please mind your head.

Most of you are at the start of your academic research career. I hope you all have the opportunity to stay in this field as long as you want, and get the same enjoyment from it that we do. There is probably not another job where you can enjoy this much autonomy (eventually) while making a good salary (eventually) all while being a position to constantly learn new things and apply that knowledge to solving real problems. For those of us in medicine and health, this latter point is especially salient, and one day you might even see your work directly contribute to interventions and policies that change people’s lives for the better.

When we talk about academic research, we tend to romanticize, as we have just done, our brilliant ideas and their inevitably positive impact on society. Unfortunately we often forget about all the stuff in the middle, stuff that is decidedly less exciting, but absolutely crucial for research to have meaningful impact. This is a talk about one of these things, your the study’s data, and in particular how you can safely turn it into your study’s results.

When we talk about research data, we have to consider its entire life cycle. We often start with some source data, which can usually be directly verified (e.g. a medical record). Then we will usually need to merge, manipulate, and modify the data so that it can be analyzed. Then once the analysis is done, the results need to be accurately shared with others. Importantly, errors introduced at any point in this life cycle can invalidate your results, often without you knowing it, and even lead to harms (more on this below). Despite this, academics often pay little attention to their data practices (see Leek and Peng, 2015, Nature 520, 612).

What’s the worst that can happen?

Since we will be asking you to pay more attention to your data practices than most other academics, including perhaps your PIs and other supervisors, it’s important to first share examples of how things can go wrong.

Sometimes researchers engage in problematic (and potentially fraudulent) research activities that encompass multiple aspects of how they manage and analyze data. For example, the long saga of Brian Wansink, or more recent concerns about Jonathan Pruitt. However, this is not really the kind of behavior we are trying to improve — if you want to go out and cheat your way to fame and fortune, there probably isn’t anything the rest of us can do to stop you (expect for the steady progress people seem to be making in the field of error and fraud detection).

Instead, we are interested in preventing relatively simple, and entirely avoidable, errors in the data pipeline that can lead to catastrophic results — even results that lead us to conclude the opposite of what the data actually support. Examples are depressingly easy to find:

To be clear, I am not highlighting these cases to shame anyone. In each case the authors correctly responded when the errors were found, to their credit. The point of course is that these simple but impactful errors can and do happen.

Seek and ye shall find.

Many academics view the above examples as outliers, but I think data errors like this are much more common. I think this for one simple reason — many scientists, maybe even most of them, don’t employ safety or quality management systems for their data practices; and if you aren’t looking for errors, you aren’t going to find them.

I spent the first 10 years like this myself. I always thought I was pretty cautious, and thus blissfully assumed my data analyses were error free. This changed when I started working in clinical trials, many of which are regulated, meaning that there are actual laws in place that govern their conduct. This of course makes sense — clinical trials of medicines often involve exposing humans to unknown but potentially-severe health risks. As such, regulated clinical trials employ quality and safety management systems that are largely aimed at safeguarding human life.

However, these systems (and the mindset they require) also extend to other aspects of trials, including data management and the software used for analyzing data. Thus, one of my first tasks as a trial statistician was to develop our local standard operating procedures that would satisfy the regulatory requirements aimed at preventing, detecting and correcting data errors.

Only you can prevent forest fires.

I want to make one more important point before we move on, which is that nobody else is going to catch these errors for you. Many of you will work years without a research supervisor asking you to “show your work”. You might have SOPs for using lab equipment or for certain experimental procedures, but none for analyzing the resulting data. You will send off your papers for review, and not a single reviewer or editor will ask for any evidence whatsoever that you handled and analyzed your data correctly, or in a manner that minimized risks of errors (this is thankfully changing, though slowly, through things like the Peer Reviewers’ Openness Initiative). Even when we do manage to find errors (and even fraud) it can take a herculean effort to correct the record. It is thus imperative that you take complete responsibility for preventing errors in your own work. It is very unlikely that anyone else will do it for you.

How to prevent errors in your data pipeline

What follows is a loose progression of things you can do to make your data pipeline more resilient to errors. Perfection is the enemy of the good though, so you should abandon any desire to employ all of these things right away. Just try to incorporate one or two at a time into your workflow.

Most of these things touch on three key themes: professionalism, transparency, and the (hopefully obvious) fact that humans make mistakes. By professionalism, I just mean that you take your data practices seriously, as if they were an important part of your job, and approach everything you do with intention. Transparency then just means doing things in an open manner where other humans could see exactly what you’ve actually done with as few barriers as possible (ideally zero). These two aspects also tend to enforce each other — the mere threat of someone actually seeing my work (whether they actually do or not) tends to make me approach it more professionally; and I’m much happier to share professional-quality work while keeping the more amateurish stuff hidden away. Finally, because humans make mistakes, it’s usually a good idea to find ways to let the computer do as much as possible.

The most important step: start using scripted analyses

If you are serious about analyzing data, you should always script your analyses. By scripting, I simply mean writing out all the code needed to completely replicate a given piece of work, starting with the source data (or as close to it as you can get) and ending with clearly communicated results following from the analysis of the data. This is the single most important change you need to make to improve quality control for your analyses. You can’t possibly catch mistakes if you don’t write them down, and it unlocks pretty much everything else that follows.

People seem to think that scripting an analysis requires years of training as a software engineer or a “computer coder”. It doesn’t. Don’t even think of it as code. It’s just writing down exactly what you would do using a GUI (i.e. “I pointed at the little button next to the big button and clicked it”). The only difference is that you might need to learn the correct terms and syntax, but this shouldn’t be any more of a challenge than learning what buttons to push in the first place.

Importantly, your code doesn’t need to be perfect. There is no such thing as perfect code anyway. Remember, the primary purpose of code in this context is reproducibility, not elegance or speed.

Then work on improving your scripts

That said, once you make that critical commitment to start scripting your data analyses, you should work to improve your scripts. The first and most important thing to work on is how you organize and annotate your scripts. The code itself is for the computer, but to make your work truly reproducible, the script overall needs to be “human-readable” as well. So just like any other form of communication, your script(s) should be well written. This means clear, concise, and well-organized, just like a middle-school essay. You should add notes everywhere, explaining exactly what it is that you are doing. You should use headings and subheadings, to divide scripts up into manageable pieces of work, and further guide the reader. Note: this isn’t just so other people can read your code, though this is an admirable goal — it’s for you too, when a reviewer asks you to check something in the analysis of the paper you submitted eight months ago.

Finally, once you get into the swing of organizing and annotating your scripts, you can start thinking about little ways to make the machine-readable code (i.e. the technical bits of code that are actually executed by the machine) as human-readable as it can be as well. First, start using a style guide. Consistently styled code is more readable, and makes it easier to spot errors since they will stand out more in a less chaotic body of code. There are no absolutes, so just choose and/or adapt one to your preferences. Here are some examples for R: Advanced R Style Guide by Hadley Wickham;Google’s R Style Guide;Tidyverse Style Guide.

I am also a big fan of alignment, even if it creates a lot of white space in your code, simply because it makes it easier to scan a scripts for errors.

You will also, with experience, start to see opportunities to write code that is easier to understand what it’s doing at a glance. Again, aim for readability and reproducibility over brevity.

Now start thinking like a software engineer

Software engineers often have to write thousands of lines of code, where a single typo might completely “break” the entire program they are writing. Much worse though is the threat of a “silent error”, one that erroneously changes the result of the program, but not in a way that is obvious. Given the potential costs of such an error, including bankruptcy for the the software company, it shouldn’t be surprising that software engineers go to considerable lengths to prevent such errors altogether, and devise processes for finding them quickly when they do occur (and they do occur). This is, quite simply, an important part of their job.

Since an error in your script could result in something similarly catastrophic (assuming you do meaningful, serious research), surely then we should make error prevention and detection an important a part of our job too, right?

Don’t repeat yourself

A useful principle in software engineering that you should try to follow is Don’t Repeat Yourself (DRY). The logic behind this principle is pretty simple. The more you type, the more opportunity there is for typos to creep into your code, and the more work it takes to go back and make changes to your code. As it happens, computers are very good at repeating things for you, so anytime you find yourself doing the same thing over and over, you can write yourself a little program or function that represents that entire process, and then use iteration to repeat the process for different inputs. If this sounds complicated, it really isn’t. When you are ready, have a look at the material here and here.

Organize and Compartmentalize

Test everything


R Markdown

You can use scripts with pretty much any statistical program or language. If you are a Stata user, for example, there is no reason not to follow the above advice using a “do” file. However, from this point forward we are going to be talking specifically about RStudio, an integrated development environment for the statistical programming language R. This is because RStudio includes a number of tools that facilitate our approach to maintaining a high quality data pipeline (or perhaps it’s more accurate to say that our approach developed as a consequence of working with RStudio and the tools it facilitates).

The first of these tools is called R Markdown (Rmd), which is basically a type of script that combines analysis code, output from that code, and other formatted text elements, which can be rendered as various document types, such as an html page, a PDF, or a Word docx. This means that I can combine the statistical methods and results sections for a paper, as well as the tables and plots, in a single document — and when something changes, such as the data, I can simply run the Rmd file again and everything that is dependent on the data (figures, plots, tables) will automatically update with the push of a button. The other amazing thing about R Markdown is that you can also use it to do things like write blogposts, books, and github pages.


Other useful resources

JHU ADS 2020 Week 3 - Organizing a data analysis