Module 3: How to prepare statistical code to share
If you have ever had to go back to a statistical code file you wrote several months earlier and try to figure out what the heck you were doing, you are not alone. Writing well-formatted and organized code is key to making future you happy and for facilitating reproducibility. Sharing clear statistical code not only facilitates reproducibility, but research papers with shared code are cited more,1,2 benefitting you and your field. This module summarizes current recommendations for organizing and formatting statistical code to prepare it for sharing and to improve research collaboration, quality, and reproducibility.
Formatting and organizing code
There are recommendations for clear coding that apply regardless of the statistical language you are using. These practices fall into several categories:
Follow the recommendations below to format your code. We recommend reading through all the recommendations first before you begin to format your code.
Use space wisely
Recommendation #1: Use white space to separate processes
Code that is written without blank lines between procedures can run together and be difficult to interpret. Add an extra line of space before new procedures to make the code easier to follow. For example, this code is dense with little white space, making it difficult to read:
code_data_avail <- cbind(table(r$Q25_2_1),table(r$Q25_2_2))
colnames(code_data_avail) <- c("Did you make your data publicly available?","Did you make your code publicly available?"); code_data_avail <- melt(code_data_avail); colnames(code_data_avail) <- c("avail","data_or_code","number")
fig2 <- ggplot(code_data_avail, aes(x=data_or_code, y=number, fill=avail)) + geom_col(position="dodge") + coord_flip() + theme(legend.position = 'top') + labs(y="Number of participants", x="", fill="") + scale_fill_manual(values=fills)
fig2
Adding some space between procedures starts to make it easier to read:
code_data_avail <- cbind(table(r$Q25_2_1),table(r$Q25_2_2))
colnames(code_data_avail) <- c("Did you make your data publicly available?","Did you make your code publicly available?"); code_data_avail <- melt(code_data_avail); colnames(code_data_avail) <- c("avail","data_or_code","number")
fig2 <- ggplot(code_data_avail, aes(x=data_or_code, y=number, fill=avail)) + geom_col(position="dodge") + coord_flip() + theme(legend.position = 'top') + labs(y="Number of participants", x="", fill="") + scale_fill_manual(values=fills)
fig2
Recommendation #2: Limit line length to 80 characters
Long lines of code are difficult or impossible to see on some computer screens, especially for people using laptop computers. Instead of continuing code on a single line, add hard returns so that code does not go beyond about 80 characters.
code_data_avail <- cbind(table(r$Q25_2_1),table(r$Q25_2_2))
colnames(code_data_avail) <- c("Did you make your data publicly available?", "Did you make your code publicly available?")
code_data_avail <- melt(code_data_avail)
colnames(code_data_avail) <- c("avail","data_or_code","number")
fig2 <- ggplot(code_data_avail, aes(x=data_or_code, y=number, fill=avail)) + geom_col(position="dodge") + coord_flip() +
theme(legend.position = 'top') +
labs(y="Number of participants", x="", fill="") + scale_fill_manual(values=fills)
fig2
Recommendation #3: Indent to group lines of code that belong together
When a function or procedure is very long, it can take multiple lines of code. To signal that several lines of code go together, indent each line after the first one.
Revise the code to shorten each line and indent lines that go together:
code_data_avail <- cbind(table(r$Q25_2_1),table(r$Q25_2_2))
colnames(code_data_avail) <- c("Did you make your data publicly available?",
"Did you make your code publicly available?")
code_data_avail <- melt(code_data_avail)
colnames(code_data_avail) <- c("avail","data_or_code","number")
fig2 <- ggplot(code_data_avail, aes(x=data_or_code,
y=number,
fill=avail)) +
geom_col(position="dodge") +
coord_flip() +
theme(legend.position = 'top') +
labs(y="Number of participants", x="", fill="") +
scale_fill_manual(values=fills)
fig2
Choose meaningful names with consistent formatting
Recommendation #4: Use meaningful names for objects
It is common in research for variables to be named something like var1 or vm1q26b during data collection. Often these names are generated automatically from survey software or from using a numbered list in a word processing program. Once data have been collected, however, assigning meaningful names for variables, functions, and files will result in code that is easier to read and use for you and for others.
Meaningful names describe what information is stored in the variable, function, or file. For example, a variable measuring whether or not a participant has ever used a gun would be easier to understand if it were called gun_user or gunUse compared to var1 or even gu. Likewise, a function that multiplies age by packs smoked per day to find pack years for smokers might be called find_pack_years rather than funcpy or even just f or the commonly used foo.
For example, other than -99 representing missing values, there is really nothing to be learned from this code that includes meaningless names:
r$Q11_2[r$Q11_2==-99] <- NA
prop.table(table(r$Q11_2))
Replace r with HEALTH_SURVEY for the data name and Q11_2 with race for the variable name. Now someone otherwise unfamiliar with your code might guess that this bit of code creates a table of race from the health survey data:
HEALTH_SURVEY$race[HEALTH_SURVEY$race==-99] <- NA
prop.table(table(HEALTH_SURVEY$race))
Recommendation #5: Use dot.case, camelCase, or snake_case for multi-part names
Multiword names for variables, functions, and files are are more easily read by humans when they are formatted using dot.case, upper CamelCase, lower camelCase, snake_case, or UPPER_SNAKE_CASE to separate words. Choosing one of these options for variables, another for functions, and another for files will further clarify your code for your collaborators. For example, say your group decides that variables should be named using dot.case, functions with snake_case, data frames with UPPER_SNAKE_CASE, and other objects with lower camelCase.
Even with limited R and programming experience, a new team member could read this code and identify find_mode as a function, HEALTH_SURVEY as data, and insure.status as a variable without knowing anything else:
find_mode(HEALTH_SURVEY$insure.status)
Please note that specific naming conventions may not work in all software. Resources at the bottom of this module provide software-specific recommendations on naming and other conventions.
Recommendation #6: Add meta-data to file names
Including meta-data like the date and project name in file names consistently makes it easier to search for files and process files. There are three key principles for file names:
Machine readable
Human readable
Works with default ordering
Machine readable file names are formatted in a way that can be read into a program and interpreted by the program. This is especially useful if a project includes many data or code files that are used together. Machine readable file names typically include separators, usually an underline, to separate parts of the file name. For example, a CSV data file collected on January 23 of 2017 might be saved as 01232017_projectName.csv. Note that this file name includes a 0 at the beginning rather than a 1 for January. This is important due to default ordering of numbers and letters. If January were coded as simply 1, the file name would be 12317 and default ordering would interpret it as coming after data files from October or November, months 10 and 11. Human readable file names include dates and words formatted in ways that are familiar to people, similar to the meaningful variable names in the section above.
Consider, for example, raw and cleaned data files collected before and after some program was implemented:
rprogbdata.csv
progbdfinal.csv
progdatar.csv
cleanprogrdata.csv
You can see some differences in the data names and might try to guess based on these features and maybe the file save date. However, writing the file names like this would allow more certainty:
013018_raw_preProgram.csv
013118_clean_preProgram.csv
022818_raw_postProgram.csv
030218_clean_postProgram.csv
It is clear from these file names which files are raw, which are clean, which are before the program, which are after, and the date the file was saved.
Introduce and explain the code
Recommendation #7: Write a prolog to introduce the code
A prolog is a block of comments at the beginning of a code file offset with special characters as appropriate in the software being used. A prolog is used to provide sufficient information about code for a collaborator (maybe future you!) or someone outside your project to understand the who, what, and why of your code.
For example, here is a prolog formatted for a statistical code file in SAS:
/* PROLOG ################################################################
PROJECT: NAME OF PROJECT HERE
PURPOSE: MAJOR POINT(S) OF WHAT I AM DOING WITH THE DATA HERE
DIR: list directory(-ies) for files here
DATA: list dataset file names/availability here, e.g.,
filename.correctextention
somewebaddress.com
AUTHOR: AUTHOR NAME(S)
CREATED: MONTH dd, YEAR
LATEST: MONTH dd, YEAR
NOTES: indent all additional lines under each heading,
& begin the prolog with a forward slash followed by an asterisk,
end with an asterisk followed by a forward slash.
KEEP PURPOSE, AUTHOR, CREATED & LATEST ENTRIES IN UPPER CASE,
with appropriate case for DIR & DATA, lower case for notes
If multiple lines become too much,
simplify and write code book and readme.
HINT #1: Decide what a long prolog is.
HINT #2: copy & paste this into new script & replace text.
PROLOG ############################################################### */
Here is a prolog formatted for a code file in R:
# PROLOG ################################################################'
# PROJECT: NAME OF PROJECT HERE
# PURPOSE: MAJOR POINT(S) OF WHAT I AM DOING WITH THE DATA HERE
# DIR: list directory(-ies) for files here
# DATA: list dataset file names/availability here, e.g.,
# filename.correctextention
# somewebaddress.com
# AUTHOR: AUTHOR NAME(S)
# CREATED: MONTH dd, YEAR
# LATEST: MONTH dd, YEAR
# NOTES: indent all additional lines under each heading,
# & use the apostrophe hashmark bookends that appear
# KEEP PURPOSE, AUTHOR, CREATED & LATEST ENTRIES IN UPPER CASE,
# with appropriate case for DIR & DATA, lower case for notes
# If multiple lines become too much,
# simplify and write code book and readme.
# HINT #1: Decide what a long prolog is.
# HINT #2: copy & paste this into new script & replace text.
# PROLOG ###############################################################
Recommendation #8: Annotate to clarify code purpose
While a prolog contains general information about the code, annotation is used throughout the code to describe what is going on. There are different schools of thought on how much or how little annotation is needed. The general goal would be to write clear code that only needs a small amount of annotation.
When commenting, think about what you would want to explain to a new collaborator or how the code reads for someone completely outside your project. With those audiences in mind, add comments to:
explain the reason for the code (if needed)
explain functionality or choices that are not obvious or are different from expected
identify hacks or errors that should be fixed or rewritten; consistently use a word or phrase so this code is easy to find (e.g., HACK or BROKEN)
Avoid using comments to:
explain poorly named objects; improve the object name instead (see Recommendation #4)
repeat things that can be easily understood from the code
For example, instead of a confusing graph name and a comment stating the obvious:
#ha stands for histogram of age
ha <- hist(age)
ha
Name your graph something logical and comment on why the code is needed:
#check normality assumption for age variable
histoAge <- hist(age)
histoAge
Complete code examples
R example
This R code does not follow the promising practices above and is difficult to read:
First, use space wisely. Adding white space to separate processes, limiting line length to 80 characters, and indenting to group lines of code that belong together can help make the code easier to follow:
Second, choose meaningful names for objects. Since the code examines the National Health and Nutrition (NHANES) survey data, naming the data object nhanes makes sense. Likewise, the variable AUQ300 measures whether the participant has ever used a gun, so try naming it gun.user:
Finally, introduce and explain the code by adding a prolog and annotation. Compare this final version of the code with the first version above:
################################################################
# PROJECT: Gun use policy brief
# PURPOSE: Bar graph of percentage of gun use for policy brief
# DIR: C:/Users/jenine/Desktop/datacamp
# DATA: NHANES 2011-2012 data available via RNHANES package
# AUTHOR: Jenine Harris
# CREATED: 11/28/17
# LATEST: 11/28/17
# NOTES: For coding2share formatting code module
################################################################
#open NHANES package to bring in data
#open ggplot2 for graphing
library(RNHANES)
library(ggplot2)
#bring in NHANES 2011-12 audiology data
#AUQ300 question asks ever used gun
#where 1 = Yes, 2 = No, 7 = Refused, 9 = Don't know
nhanes <- nhanes_load_data("AUQ_G", "2011-2012",
demographics = TRUE)
summary(nhanes$AUQ300)
#delete Refused and Don't know responses
#rename variable to gunUser, add labels to levels
nhanes$AUQ300[nhanes$AUQ300 > 2] <- NA
nhanes$gun.user <- factor(nhanes$AUQ300,
levels=c(1,2),
labels=c("Yes","No"))
summary(nhanes$gun.user)
#plot bar graph of percent of 2011-2012 NHANES
#participants who ever used gun
ggplot(subset(nhanes,
!is.na(gun.user)),
aes(x=gun.user,
y=(..count..)/sum(..count..),
fill=gun.user)) +
geom_bar() +
theme_minimal() +
scale_y_continuous(labels=scales::percent) +
scale_fill_manual(values=c("orange","gray"),
guide=FALSE)
SPSS example
This SPSS code file example does not follow the promising practices above and it is difficult to decipher where one line of code stops and the next begins.
First, use space wisely. Adding white space to separate processes, limiting line length to 80 characters, and indenting to group lines of code that belong together can help make the code easier to follow.
Second, choose meaningful names for the data set and variable labels. Since the code examines the National Health and Nutrition (NHANES) survey data, naming the data set nhanes makes sense. Naming the data set allows several data sets to be open at the same time. Explicitly activate the data set before continuing so it’s clear which one is being modified or used. Likewise, variable labels enable access to more detail about what the variable actually is, so we can include enough of the questionnaire language for it to make sense: “Ever used firearms for any reason?”
GETFILE = 'C:\Your\Filepath\nhanes.sav'. DATASET NAME nhanes. DATASET ACTIVATE nhanes.
VARIABLE LABELS
AUQ300 'Ever used firearms for any reason?'.
FREQUENCIES VARIABLES = AUQ300.
GRAPH /BAR(SIMPLE) =
PCT BY AUQ300.
Finally, introduce and explain the code by adding a prolog and annotation. Be sure to include a period at the end of each comment so that the following code command will not get “commented out” by accident. Compare this final version of the code with the first version above:
*** PROLOG ***********************************************************************
*** PROJECT: Gun use policy brief
*** PURPOSE: Bar graph of percentage of gun use for policy brief
*** DIR: C:\Your\Filepath\nhanes.sav
*** DATA: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2011
*** AUTHOR: Bobbi Carothers
*** CREATED: 02/06/2018
*** LATEST: 02/06/2018
*** NOTES: For coding2share formatting code module
*** PROLOG *********************************************************************.
* Open, name, and activate data.
GETFILE = 'C:\Your\Filepath\nhanes.sav'. DATASET NAME nhanes. DATASET ACTIVATE nhanes.
* 1 = Yes, 2 = No, 7 = Refused, 9 = Don't know.
FREQUENCIES VARIABLES = AUQ300.
* Clean and format.
* Set 7 and 9 as missing values.
VARIABLE LABELS
AUQ300 'Ever used firearms for any reason?'.
FREQUENCIES VARIABLES = AUQ300.
* Plot bar graph of percent of 2011-2012 NHANES participants who ever used gun.
GRAPH /BAR(SIMPLE) =
PCT BY AUQ300.
Stata example
This Stata code file example does not follow the promising practices above and it is difficult to decipher where one line of code stops and the next begins.
sysuse uslifeexp des tsset year twoway (tsline le) (tsline le_male)
(tsline le_female) (tsline le_w)
(tsline le_wmale) ///
(tsline le_wfemale) (tsline le_b)
(tsline le_bmale) (tsline le_bfemale) label var le "Life expectancy, overall" twoway (tsline le)
(tsline le_male) (tsline le_female)
(tsline le_w) (tsline le_wmale)
///
(tsline le_wfemale) (tsline le_b)
(tsline le_bmale) (tsline le_bfemale) twoway (tsline le)
(tsline le_male) (tsline le_female)
(tsline le_w) (tsline le_wmale)
///
(tsline le_wfemale) (tsline le_b)
(tsline le_bmale) (tsline le_bfemale),
yti("Age in years") ///
ti("Life expectancy in the U.S. by population group, 1900-1999")
First, use space wisely. Adding white space to separate processes, limiting line length to 80 characters, and indenting to group lines of code that belong together can help make the code easier to follow. Notice that Stata’s do-file editor comes with a faint line at 80 characters indicating that you should keep lines of code to this length. Second, notice that our recommendation to choose meaningful names for objects (in this case, variables) is met by the stock data set in Stata. The describe function shows us that “le” stands for “life expectancy” and that “b” & “w” stand for “black/African American” and “white”. With a small data set like this, these short variable names are informative and quick to type.
twoway (tsline le)
(tsline le_male) (tsline le_female)
(tsline le_w) ///
(tsline le_wmale) (tsline le_wfemale)
(tsline le_b) ///
(tsline le_bmale) (tsline le_bfemale),
yti("Age in years") ///
ti("Life expectancy in the U.S. by population group, 1900-1999")
SAS example
This SAS code file example does not follow the promising practices above and it is difficult to decipher where one line of code stops and the next begins.
libname OpenSci '\\tsclient\G\CPHSS\OpenScience\Modules\Coding\'; data OpenSci.cor; set OpenSci.new_w4bmi; if H4WP25 in (0,7) then H4WP25_1=0; else if H4WP25 in (1,2,3) then H4WP25_1=1; if H4WP39 in (0,7) then H4WP39_1=0; else if H4WP39 in (6,8) then H4WP39_1=.; else H4WP39_1=1; if H4WP25_1=1 or H4WP39_1=1then new=1; else if H4WP25_1=0 and H4WP39_1=0then new=0; label new="Cash from parents"; run; proc format; value new 0="NO"1="YES";run; proc freq; tables new*H4WP25_1*H4WP39_1 /list missprint; run; proc corr; var new; with fastfood; run;
First, use space wisely. Limit line length to 80 characters, indent group lines of code that belong together, use all CAPS for DATA and PROC functions and end all DATA and PROC steps with RUN statement. Notice that SAS automatically comes up with a faint line signaling that the lines above go together. This step facilitates scanning a page for step boundaries and can help make the code easier to follow.
Second, choose meaningful names for objects. Since the code examines the data about youth health, naming the data object Youth_Health makes sense. Likewise, the variable H4WP25 and H4WP39 measures the cash mom and dad gave to the youths, so try naming them as Mom_Cash and Dad_Cash, and naming the new combined variable as Parent_Cash.