Module 3: How to prepare statistical code to share

If you have ever had to go back to a statistical code file you wrote several months earlier and try to figure out what the heck you were doing, you are not alone. Writing well-formatted and organized code is key to making future you happy and for facilitating reproducibility. Sharing clear statistical code not only facilitates reproducibility, but research papers with shared code are cited more,^1,2 benefitting you and your field. This module summarizes current recommendations for organizing and formatting statistical code to prepare it for sharing and to improve research collaboration, quality, and reproducibility.

Formatting and organizing code

There are recommendations for clear coding that apply regardless of the statistical language you are using. These practices fall into several categories:

Use space wisely
Choose meaningful names with consistent formatting
Introduce and explain the code

Follow the recommendations below to format your code. We recommend reading through all the recommendations first before you begin to format your code.

Use space wisely

Recommendation #1: Use white space to separate processes

Code that is written without blank lines between procedures can run together and be difficult to interpret. Add an extra line of space before new procedures to make the code easier to follow. For example, this code is dense with little white space, making it difficult to read:

code_data_avail <- cbind(table(r$Q25_2_1),table(r$Q25_2_2))
colnames(code_data_avail) <- c("Did you make your data publicly available?","Did you make your code publicly available?"); code_data_avail <- melt(code_data_avail); colnames(code_data_avail) <- c("avail","data_or_code","number")
fig2 <- ggplot(code_data_avail, aes(x=data_or_code, y=number, fill=avail)) + geom_col(position="dodge") +  coord_flip() +   theme(legend.position = 'top') +   labs(y="Number of participants", x="", fill="") + scale_fill_manual(values=fills)
fig2

Adding some space between procedures starts to make it easier to read:

code_data_avail <- cbind(table(r$Q25_2_1),table(r$Q25_2_2))
colnames(code_data_avail) <- c("Did you make your data publicly available?","Did you make your code publicly available?"); code_data_avail <- melt(code_data_avail); colnames(code_data_avail) <- c("avail","data_or_code","number")

fig2 <- ggplot(code_data_avail, aes(x=data_or_code, y=number, fill=avail)) + geom_col(position="dodge") +  coord_flip() +   theme(legend.position = 'top') +   labs(y="Number of participants", x="", fill="") + scale_fill_manual(values=fills)
fig2

Recommendation #2: Limit line length to 80 characters

Long lines of code are difficult or impossible to see on some computer screens, especially for people using laptop computers. Instead of continuing code on a single line, add hard returns so that code does not go beyond about 80 characters.

code_data_avail <- cbind(table(r$Q25_2_1),table(r$Q25_2_2))
colnames(code_data_avail) <- c("Did you make your data publicly available?", "Did you make your code publicly available?") 
code_data_avail <- melt(code_data_avail)
colnames(code_data_avail) <- c("avail","data_or_code","number")

fig2 <- ggplot(code_data_avail, aes(x=data_or_code, y=number, fill=avail)) + geom_col(position="dodge") +  coord_flip() +   
theme(legend.position = 'top') +   
labs(y="Number of participants", x="", fill="") + scale_fill_manual(values=fills)
fig2

Recommendation #3: Indent to group lines of code that belong together

When a function or procedure is very long, it can take multiple lines of code. To signal that several lines of code go together, indent each line after the first one.

Revise the code to shorten each line and indent lines that go together:

code_data_avail <- cbind(table(r$Q25_2_1),table(r$Q25_2_2))
colnames(code_data_avail) <- c("Did you make your data publicly available?",
                               "Did you make your code publicly available?")
code_data_avail <- melt(code_data_avail)
colnames(code_data_avail) <- c("avail","data_or_code","number")


fig2 <- ggplot(code_data_avail, aes(x=data_or_code, 
                                    y=number, 
                                    fill=avail)) +
  geom_col(position="dodge") +  
  coord_flip() +   
  theme(legend.position = 'top') +   
  labs(y="Number of participants", x="", fill="") +  
  scale_fill_manual(values=fills)
fig2

Choose meaningful names with consistent formatting

Recommendation #4: Use meaningful names for objects

It is common in research for variables to be named something like var1 or vm1q26b during data collection. Often these names are generated automatically from survey software or from using a numbered list in a word processing program. Once data have been collected, however, assigning meaningful names for variables, functions, and files will result in code that is easier to read and use for you and for others.

Meaningful names describe what information is stored in the variable, function, or file. For example, a variable measuring whether or not a participant has ever used a gun would be easier to understand if it were called gun_user or gunUse compared to var1 or even gu. Likewise, a function that multiplies age by packs smoked per day to find pack years for smokers might be called find_pack_years rather than funcpy or even just f or the commonly used foo.

For example, other than -99 representing missing values, there is really nothing to be learned from this code that includes meaningless names:

r$Q11_2[r$Q11_2==-99] <- NA
prop.table(table(r$Q11_2))

Replace r with HEALTH_SURVEY for the data name and Q11_2 with race for the variable name. Now someone otherwise unfamiliar with your code might guess that this bit of code creates a table of race from the health survey data:

HEALTH_SURVEY$race[HEALTH_SURVEY$race==-99] <- NA
prop.table(table(HEALTH_SURVEY$race))

Recommendation #5: Use dot.case, camelCase, or snake_case for multi-part names

Multiword names for variables, functions, and files are are more easily read by humans when they are formatted using dot.case, upper CamelCase, lower camelCase, snake_case, or UPPER_SNAKE_CASE to separate words. Choosing one of these options for variables, another for functions, and another for files will further clarify your code for your collaborators. For example, say your group decides that variables should be named using dot.case, functions with snake_case, data frames with UPPER_SNAKE_CASE, and other objects with lower camelCase.

Even with limited R and programming experience, a new team member could read this code and identify find_mode as a function, HEALTH_SURVEY as data, and insure.status as a variable without knowing anything else:

find_mode(HEALTH_SURVEY$insure.status)

Please note that specific naming conventions may not work in all software. Resources at the bottom of this module provide software-specific recommendations on naming and other conventions.

Recommendation #6: Add meta-data to file names

Including meta-data like the date and project name in file names consistently makes it easier to search for files and process files. There are three key principles for file names:

Machine readable
Human readable
Works with default ordering

Machine readable file names are formatted in a way that can be read into a program and interpreted by the program. This is especially useful if a project includes many data or code files that are used together. Machine readable file names typically include separators, usually an underline, to separate parts of the file name. For example, a CSV data file collected on January 23 of 2017 might be saved as 01232017_projectName.csv. Note that this file name includes a 0 at the beginning rather than a 1 for January. This is important due to default ordering of numbers and letters. If January were coded as simply 1, the file name would be 12317 and default ordering would interpret it as coming after data files from October or November, months 10 and 11. Human readable file names include dates and words formatted in ways that are familiar to people, similar to the meaningful variable names in the section above.

Consider, for example, raw and cleaned data files collected before and after some program was implemented:

rprogbdata.csv
progbdfinal.csv
progdatar.csv
cleanprogrdata.csv

You can see some differences in the data names and might try to guess based on these features and maybe the file save date. However, writing the file names like this would allow more certainty:

013018_raw_preProgram.csv
013118_clean_preProgram.csv
022818_raw_postProgram.csv
030218_clean_postProgram.csv

It is clear from these file names which files are raw, which are clean, which are before the program, which are after, and the date the file was saved.

Introduce and explain the code

Recommendation #7: Write a prolog to introduce the code

A prolog is a block of comments at the beginning of a code file offset with special characters as appropriate in the software being used. A prolog is used to provide sufficient information about code for a collaborator (maybe future you!) or someone outside your project to understand the who, what, and why of your code.

For example, here is a prolog formatted for a statistical code file in SAS:

/* PROLOG   ################################################################

   PROJECT: NAME OF PROJECT HERE 
   PURPOSE: MAJOR POINT(S) OF WHAT I AM DOING WITH THE DATA HERE 
   DIR:     list directory(-ies) for files here 
   DATA:    list dataset file names/availability here, e.g., 
            filename.correctextention 
            somewebaddress.com 
   AUTHOR:  AUTHOR NAME(S) 
   CREATED: MONTH dd, YEAR
   LATEST:  MONTH dd, YEAR
   NOTES:   indent all additional lines under each heading, 
            & begin the prolog with a forward slash followed by an asterisk,
            end with an asterisk followed by a forward slash.  
            KEEP PURPOSE, AUTHOR, CREATED & LATEST ENTRIES IN UPPER CASE,  
            with appropriate case for DIR & DATA, lower case for notes 
            If multiple lines become too much, 
            simplify and write code book and readme. 
            HINT #1: Decide what a long prolog is. 
            HINT #2: copy & paste this into new script & replace text.

   PROLOG   ############################################################### */

Here is a prolog formatted for a code file in R:

# PROLOG   ################################################################'

# PROJECT: NAME OF PROJECT HERE
# PURPOSE: MAJOR POINT(S) OF WHAT I AM DOING WITH THE DATA HERE
# DIR:     list directory(-ies) for files here
# DATA:    list dataset file names/availability here, e.g.,
#          filename.correctextention 
#          somewebaddress.com 
# AUTHOR:  AUTHOR NAME(S) 
# CREATED: MONTH dd, YEAR 
# LATEST:  MONTH dd, YEAR 
# NOTES:   indent all additional lines under each heading, 
#          & use the apostrophe hashmark bookends that appear  
#          KEEP PURPOSE, AUTHOR, CREATED & LATEST ENTRIES IN UPPER CASE, 
#          with appropriate case for DIR & DATA, lower case for notes 
#          If multiple lines become too much, 
#          simplify and write code book and readme. 
#          HINT #1: Decide what a long prolog is. 
#          HINT #2: copy & paste this into new script & replace text. 

# PROLOG   ###############################################################

Additional prolog templates for SAS, SPSS, Stata, and R are available on GitHub at https://github.com/coding2share/Prolog-templates

Recommendation #8: Annotate to clarify code purpose

While a prolog contains general information about the code, annotation is used throughout the code to describe what is going on. There are different schools of thought on how much or how little annotation is needed. The general goal would be to write clear code that only needs a small amount of annotation.

When commenting, think about what you would want to explain to a new collaborator or how the code reads for someone completely outside your project. With those audiences in mind, add comments to:

explain the reason for the code (if needed)
explain functionality or choices that are not obvious or are different from expected
identify hacks or errors that should be fixed or rewritten; consistently use a word or phrase so this code is easy to find (e.g., HACK or BROKEN)

Avoid using comments to:

explain poorly named objects; improve the object name instead (see Recommendation #4)
repeat things that can be easily understood from the code

For example, instead of a confusing graph name and a comment stating the obvious:

#ha stands for histogram of age
ha <- hist(age)
ha

Name your graph something logical and comment on why the code is needed:

#check normality assumption for age variable
histoAge <- hist(age)
histoAge

Complete code examples

R example

This R code does not follow the promising practices above and is difficult to read:

library(RNHANES); library(ggplot2)
dat <- nhanes_load_data("AUQ_G", "2011-2012", demographics = TRUE)
summary(dat$AUQ300)
dat$AUQ300[dat$AUQ300 > 2] <- NA
dat$AUQ300 <- factor(dat$AUQ300,levels=c(1,2),labels=c("Yes","No"))
summary(dat$AUQ300)
ggplot(subset(dat, !is.na(AUQ300)),aes(x=AUQ300, y=(..count..)/sum(..count..),fill=AUQ300)) + geom_bar() + theme_minimal() + scale_y_continuous(labels=scales::percent) +  scale_fill_manual(values=c("orange","gray"), guide=FALSE)

library(RNHANES)
library(ggplot2)

dat <- nhanes_load_data("AUQ_G", "2011-2012", 
                        demographics = TRUE)
summary(dat$AUQ300)

dat$AUQ300[dat$AUQ300 > 2] <- NA
dat$AUQ300 <- factor(dat$AUQ300,
                     levels=c(1,2),            
                     labels=c("Yes","No"))
summary(dat$AUQ300)

ggplot(subset(dat, !is.na(AUQ300)),
       aes(x=AUQ300, 
           y=(..count..)/sum(..count..),
           fill=AUQ300)) + 
  geom_bar() + 
  theme_minimal() + 
  scale_y_continuous(labels=scales::percent) +  
  scale_fill_manual(values=c("orange","gray"), 
                    guide=FALSE)

Second, choose meaningful names for objects. Since the code examines the National Health and Nutrition (NHANES) survey data, naming the data object nhanes makes sense. Likewise, the variable AUQ300 measures whether the participant has ever used a gun, so try naming it gun.user:

library(RNHANES)
library(ggplot2)

nhanes <- nhanes_load_data("AUQ_G", "2011-2012", 
                        demographics = TRUE)
summary(nhanes$AUQ300)

nhanes$AUQ300[nhanes$AUQ300 > 2] <- NA
nhanes$gun.user <- factor(nhanes$AUQ300,
                     levels=c(1,2),            
                     labels=c("Yes","No"))
summary(nhanes$gun.user)

ggplot(subset(nhanes, !is.na(gun.user)),
       aes(x=gun.user, 
           y=(..count..)/sum(..count..),
           fill=gun.user)) + 
  geom_bar() + 
  theme_minimal() + 
  scale_y_continuous(labels=scales::percent) +  
  scale_fill_manual(values=c("orange","gray"), 
                    guide=FALSE)

Finally, introduce and explain the code by adding a prolog and annotation. Compare this final version of the code with the first version above:

################################################################
# PROJECT: Gun use policy brief 
# PURPOSE: Bar graph of percentage of gun use for policy brief 
# DIR:     C:/Users/jenine/Desktop/datacamp
# DATA:    NHANES 2011-2012 data available via RNHANES package
# AUTHOR:  Jenine Harris 
# CREATED: 11/28/17 
# LATEST:  11/28/17 
# NOTES:   For coding2share formatting code module
################################################################

#open NHANES package to bring in data
#open ggplot2 for graphing
library(RNHANES)
library(ggplot2)

#bring in NHANES 2011-12 audiology data
#AUQ300 question asks ever used gun
#where 1 = Yes, 2 = No, 7 = Refused, 9 = Don't know
nhanes <- nhanes_load_data("AUQ_G", "2011-2012", 
                           demographics = TRUE)
summary(nhanes$AUQ300)

#delete Refused and Don't know responses
#rename variable to gunUser, add labels to levels
nhanes$AUQ300[nhanes$AUQ300 > 2] <- NA
nhanes$gun.user <- factor(nhanes$AUQ300,
                        levels=c(1,2),
                        labels=c("Yes","No"))
summary(nhanes$gun.user)

#plot bar graph of percent of 2011-2012 NHANES 
#participants who ever used gun
ggplot(subset(nhanes, 
              !is.na(gun.user)),
       aes(x=gun.user, 
           y=(..count..)/sum(..count..),
           fill=gun.user)) + 
  geom_bar() + 
  theme_minimal() + 
  scale_y_continuous(labels=scales::percent) +  
  scale_fill_manual(values=c("orange","gray"), 
                    guide=FALSE)

SPSS example

This SPSS code file example does not follow the promising practices above and it is difficult to decipher where one line of code stops and the next begins.

GET FILE = 'C:\Your\Filepath\nhanes.sav'.
FREQUENCIES VARIABLES = AUQ300.
MISSING VALUES AUQ300 (7,9).
VARIABLE LEVEL AUQ300 (NOMINAL).
VALUE LABELS AUQ300 1 'Yes' 2 'No'.
FREQUENCIES VARIABLES = AUQ300.
GRAPH /BAR(SIMPLE) = PCT BY AUQ300.

GET FILE = 'C:\Your\Filepath\nhanes.sav'.