Introduction

This document builds:

A machine-readable XML codebook
A human-readable HTML codebook

Why would I do this to myself? XML looks a lot like HTML with the exception that you can put <otherStuff> inside of the tags. This makes the information nested inside of those tags selectable and readable by machines. The advantage is that if you’re sharing a large dataset in a non-proprietary format (such as CSV) that otherwise can’t easily label things like SPSS and SAS can, you can also share the XML codebook that can be queried to pull the labels.

But XML files are ugly! Yes. Yes they are. That’s why we’ll also walk through how to write an XSL file (which is really just another XML file) that will arrange things the way we want them. Pair it to the XML file, add a little CSS if you really wanna go crazy and spice things up, and they produce a beautiful HTML baby that is more pleasing to the human eye.

You’ll need a solid working knowledge of R and HTML and/or a willingness to get in over your head for a bit before you eventually figure it out.

For this exercise, we’ll walk through the SPSS dataset from the survey we conducted to guide us through module development and can be pulled off of the coding2share GitHub page. Fire up a new Rmd file and dive in.

library(haven) # read/write SPSS
library(XML) # convert lists to xml
library(xslt) # compile xml and xsl to html

## Loading required package: xml2

library(htmltools) # embed HTML content
dat <- read_sav("https://github.com/coding2share/OpenSciSurveyPaper/blob/master/OpenScienceAim1.sav?raw=true") # Pull data off of GitHub

Pull lists from data

SPSS files, when set up properly, contain metadata that are useful for building codebooks. Using lapply with the functions that pull these characteristics (such as attr(dat$var, "label") from a particular variable) will iterate through the entire dataset and pull that characteristic for each variable and stack them into an orderly list.

Note that you’ll need to deal with any special characters that may be present in any of the text that gets pulled, as seen with the gsub function below. If you have a lot of variables, they may not be worth hunting for until you look at the finished XML file. If you get an “XML Parsing Error: not well-formed” message or something similar when trying to open the file, it’s probably a special character issue; it should tell you where the problem is and you can go back and replace the characters. This is the easiest place to do so.

# Variable names
varNames <- as.list(names(dat))
names(varNames) <- names(dat)
# Variable labels
varLabs <- lapply(dat, attr, "label")
varLabs <- lapply(varLabs, function(x) gsub("[\u2019]", "'", x)) # remove curly apostrophe
# Formats
varForm <- lapply(dat, class)
# Value labels
valLabs <- lapply(dat, attr, "labels")

Build the codebook XML tree

The file is easiest to build from the bottom up. The hierarchy is as follows:

study title
study summary
variable 1
- name
- format
- label
- value codes (if applicable)
  - value-label pair 1
    - value
    - label
  - value-label pair 2 (etc)
variable 2 (etc)

The following code will loop through all of the variables and value codes to pull everything that is needed.

First level

Build the value and value-label pairs:

# Code values
vval <- lapply(valLabs, function(x){
  lapply(x, function(x) newXMLNode(name="val",x))
} )
# Value labels
# Pull label strings
varValLabs <- lapply(valLabs, attr, "names")
vvlabs <- lapply(varValLabs, function(x){
  lapply(x, function(x) newXMLNode(name="codeLabel",x))
})
# Check Source example
vval[["Source"]]

## $Email
## <val>1</val> 
## 
## $JPHMP
## <val>2</val> 
## 
## $RWJF
## <val>3</val>

vvlabs[["Source"]]

## [[1]]
## <codeLabel>Email</codeLabel> 
## 
## [[2]]
## <codeLabel>JPHMP</codeLabel> 
## 
## [[3]]
## <codeLabel>RWJF</codeLabel>

Second level

Nest val and label inside <pair> tags:

# Add pair as a parent to the values
pairs <- lapply(vval, function(x){
  lapply(x, function(x) newXMLNode(name="pair", .children=list(x)))
}) 
# Add labels as a child to pairs
for(i in 1:length(pairs)){
  if (length(pairs[[i]]) > 0){ # otherwise hangs for vars with no value/label pairs
    for(j in 1:length(pairs[[i]])){
    addChildren(pairs[[i]][[j]], kids=list(vvlabs[[i]][[j]]))
      }
  }
}
# Check Source example
pairs[["Source"]]

## $Email
## <pair>
##   <val>1</val>
##   <codeLabel>Email</codeLabel>
## </pair> 
## 
## $JPHMP
## <pair>
##   <val>2</val>
##   <codeLabel>JPHMP</codeLabel>
## </pair> 
## 
## $RWJF
## <pair>
##   <val>3</val>
##   <codeLabel>RWJF</codeLabel>
## </pair>

The important part of the structure here is that each value and each label are appropriately nested in their own pairs. Note also that the code has done this for all variables that have labels for their values.

Third level

Wrap up value codes, build name, format, and labels:

# Add value label pairs to parents called "codes"
vcodes <- lapply(pairs, function(x) 
  newXMLNode(name="codes", .children=list(x))
  )

# Pull variable names, formats, formats, and labels
vnames <- lapply(varNames, function(x) newXMLNode(name="name",x))
vform <- lapply(varForm, function(x) newXMLNode(name="format",x))
vlabs <- lapply(varLabs, function(x) newXMLNode(name="varLabel",x))

Fourth level

Wrap up the variables:

# Add vnames,vform, vlabs, and codes to parents called "var"
vars <- mapply(function(w,x,y,z) 
  newXMLNode(name="var", .children=list(w,x,y,z)), 
  vnames,vform,vlabs,vcodes)
# Check Source example
vars[["Source"]]

## <var>
##   <name>Source</name>
##   <format>labelled</format>
##   <varLabel>Source</varLabel>
##   <codes>
##     <pair>
##       <val>1</val>
##       <codeLabel>Email</codeLabel>
##     </pair>
##     <pair>
##       <val>2</val>
##       <codeLabel>JPHMP</codeLabel>
##     </pair>
##     <pair>
##       <val>3</val>
##       <codeLabel>RWJF</codeLabel>
##     </pair>
##   </codes>
## </var>

Here you can see how all of the information for each variable is wrapped up in nested levels.

Add title and summary:

# Title
cTitle <- newXMLNode(name="studyTitle", "Open Science Aim 1 Codebook")

# Summary
cSum <- newXMLNode(name="summary", 
                   newXMLNode(name="lin", "Project contact person: Jenine Harris, harrisj@wustl.edu"),
                   newXMLNode(name="lin", "Project description: Surveyed public health practitioners on reproducible research practices"),
                   newXMLNode(name="lin", "Sample and survey procedures: Email invitation for a web-based survey to members of the American Public Health Association statistics section, posted web link on project Twitter feed coding2share, RWJF Twitter feed, and JPHMP Twitter feed"),
                   newXMLNode(name="lin", "Sample size: 247 participants; 207 complete surveys"),
                   newXMLNode(name="lin", "Data collection time period: Sept-Dec 2017")
                   )
cSum

## <summary>
##   <lin>Project contact person: Jenine Harris, harrisj@wustl.edu</lin>
##   <lin>Project description: Surveyed public health practitioners on reproducible research practices</lin>
##   <lin>Sample and survey procedures: Email invitation for a web-based survey to members of the American Public Health Association statistics section, posted web link on project Twitter feed coding2share, RWJF Twitter feed, and JPHMP Twitter feed</lin>
##   <lin>Sample size: 247 participants; 207 complete surveys</lin>
##   <lin>Data collection time period: Sept-Dec 2017</lin>
## </summary>

Here you can see how each line of the summary is wrapped in a <lin> tag.

Top level

Wrap title, summary, and variables into an overall codebook node:

# Add vars to main codebook node
cb <- newXMLNode(name="codebook", .children=list(cTitle, cSum, vars))

Export the XML file:

saveXML(cb, file="OpenScienceAim1Codebook.xml", encoding="UTF-8")

At this point, it’s a good idea to open the XML file in the web browser of your choice and see if things generally look like they should.

Write the XSL formatting file

The XML codebook file created above is great for machines to read, but less great for humans to read. To fix that, we’ll create an XSL file that will arrange the XML file in the layout of our choosing, giving us a much more appealing HTML codebook file.

Having an idea of what you want the layout for the codebook to be is enormously helpful, as there are a number of ways to crumble this cookie. Here’s roughly what I’m shooting for in this example:

Variable 1 Name
Format:	Format of var1
Label:	Label of var1

Value	Label
1	Label 1
2	Label 2
3	Label 3

We’ll build one table for each variable with format and label information, and an additional table for variables with value labels. On the downside, building tables in HTML is tricky. On the upside, we only have to define each kind of table once. The XML will cycle through all of the variables and things will show up if they’re present, and if not (as in the case of variables without value labels), there’s no squawking.

The following tags set up the structure for building tables in HTML:

<tr> for Table Row, which wraps around:
- <th> for Table Head (header cell)
- <td> for Table Data (regular cell)

We’ll also include the usual <h1> and <p> tags where necessary for text outside of the tables.

Within those tags, we’ll take advantage of the machine-readability of the XML file to pull that information into the tables. Within each header or data cell, we’ll use the <xsl:value-of select="xyz"> tag, where “xyz” corresponds to the tags we set up in the XML file: <format>, <varLabel>, etc.

Instead of building from the bottom up like we did for the XML file, this time we’ll build from the top down. We’ll need to include some namespace specifications, and a definition has to happen at the top to cover everything. We’ll also need to use the xmlTree method instead, as it allows us to nest XML tags within html style tags more easily.

# Doc level
cbxsl <- xmlTree("xsl:stylesheet", namespaces=list(xsl="http://www.w3.org/1999/XSL/Transform"),
                 attrs=c(version="1.0"), doc=newXMLDoc())
cbxsl$addNode("xsl:template", attrs=c(match="/"), close=FALSE)
  cbxsl$addTag("html", close=FALSE)
    cbxsl$addTag("head",
                 cbxsl$addTag("style", # CSS code below
"
h1 {
  color: #0000b2;
}

h2 {
  font-size: 16px;
  color: #0000b2;
}

th, td {
  text-align: left;
  vertical-align: top;
  padding-right: 10px;
}

"
                              
                              
                              )
                 )
    cbxsl$addTag("body", close=FALSE)
      cbxsl$addNode("xsl:for-each", attrs=c(select="codebook"), close=FALSE)
        cbxsl$addTag("h1",
                     cbxsl$addNode("xsl:value-of", attrs=c(select="studyTitle"))
                     )
        cbxsl$addNode("xsl:for-each", attrs=c(select="summary"), close=FALSE)
          cbxsl$addNode("xsl:for-each", 
                        cbxsl$addTag("p", # Take all of the summary elements
                                     cbxsl$addNode("xsl:value-of", attrs=c(select="."))
                                     ),
                        attrs=c(select="./*"))
        cbxsl$closeNode() # close summary node
# Variable level
        cbxsl$addNode("xsl:for-each", attrs=c(select="var"), close=FALSE)
          cbxsl$addTag("h2",
                       cbxsl$addNode("xsl:value-of", attrs=c(select="name"))
                       )
          cbxsl$addTag("table", close=FALSE) # start Format/Label table
            cbxsl$addTag("tr",
                         cbxsl$addTag("td","Format:"),
                         cbxsl$addTag("td",
                                      cbxsl$addNode("xsl:value-of", attrs=c(select="format"))
                                      )
                         )
            cbxsl$addTag("tr",
                         cbxsl$addTag("td", "Label:"),
                         cbxsl$addTag("td",
                                      cbxsl$addNode("xsl:value-of", attrs=c(select="varLabel"))
                                      )
                         )
          cbxsl$closeTag() # close Format/Label table
# Value/code level
          # Only pulls for variables that have value/code pairs
          cbxsl$addNode("xsl:for-each", attrs=c(select="codes[.!='']"), close=FALSE)
            cbxsl$addTag("table", close=FALSE) # start Value/Label table
              cbxsl$addTag("tr", # header row
                           cbxsl$addTag("th","Value"),
                           cbxsl$addTag("th","Label")
                           )
              cbxsl$addNode("xsl:for-each", attrs=c(select="pair"), close=FALSE)
                cbxsl$addTag("tr", # value/label rows
                             cbxsl$addTag("td",
                                          cbxsl$addNode("xsl:value-of", attrs=c(select="val"))
                                          ),
                             cbxsl$addTag("td",
                                          cbxsl$addNode("xsl:value-of", attrs=c(select="codeLabel"))
                                          )
                             )
              cbxsl$closeNode() # close pair node
            cbxsl$closeTag() # close Value/Label table
          cbxsl$closeNode() # close value/code level
        cbxsl$addTag("hr") # horizontal rule between variables
# Take a look
cbxsl$value()

## <?xml version="1.0"?>
## <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
##   <xsl:template match="/">
##     <html>
##       <head>
##         <style>
## h1 {
##   color: #0000b2;
## }
## 
## h2 {
##   font-size: 16px;
##   color: #0000b2;
## }
## 
## th, td {
##   text-align: left;
##   vertical-align: top;
##   padding-right: 10px;
## }
## 
## </style>
##       </head>
##       <body>
##         <xsl:for-each select="codebook">
##           <h1>
##             <xsl:value-of select="studyTitle"/>
##           </h1>
##           <xsl:for-each select="summary">
##             <xsl:for-each select="./*">
##               <p>
##                 <xsl:value-of select="."/>
##               </p>
##             </xsl:for-each>
##           </xsl:for-each>
##           <xsl:for-each select="var">
##             <h2>
##               <xsl:value-of select="name"/>
##             </h2>
##             <table>
##               <tr>
##                 <td>Format:</td>
##                 <td>
##                   <xsl:value-of select="format"/>
##                 </td>
##               </tr>
##               <tr>
##                 <td>Label:</td>
##                 <td>
##                   <xsl:value-of select="varLabel"/>
##                 </td>
##               </tr>
##             </table>
##             <xsl:for-each select="codes[.!='']">
##               <table>
##                 <tr>
##                   <th>Value</th>
##                   <th>Label</th>
##                 </tr>
##                 <xsl:for-each select="pair">
##                   <tr>
##                     <td>
##                       <xsl:value-of select="val"/>
##                     </td>
##                     <td>
##                       <xsl:value-of select="codeLabel"/>
##                     </td>
##                   </tr>
##                 </xsl:for-each>
##               </table>
##             </xsl:for-each>
##             <hr/>
##           </xsl:for-each>
##         </xsl:for-each>
##       </body>
##     </html>
##   </xsl:template>
## </xsl:stylesheet>
##

The strategy here is to be mindful about when you want to nest inside of a node and when you just want to move on to the next node. The default behavior for addNode and addTag is to close the node and move on, so set close=FALSE if you want to nest children within it, then follow with closeNode() or closeTag() when you’re done with the children. Alternatively, you can add children by nesting functions. For example:

cbxsl$addTag("tr",
             cbxsl$addTag("td",
                          cbxsl$addNode("xsl:value-of", attrs=c(select="val"))
                          ),
             cbxsl$addTag("td",
                          cbxsl$addNode("xsl:value-of", attrs=c(select="codeLabel"))
                          )
             )

produces the following:

<tr>
  <td>
    <xsl:value-of select="val"/>
  </td>
  <td>
    <xsl:value-of select="codeLabel"/>
  </td>
</tr>

Note that it’s probably easier to build this structure by hand in RStudio; it’s handy with closing the tags for you and keeping track of your indents. But if you want to keep everything in one Rmd document, xmlTree would be the way to go.

Compile

Merging the XML and XSL files into an HTML file is simple with the xml_xslt function:

# Bring in the external xml codebook file
doc <- read_xml("OpenScienceAim1Codebook.xml")
# Convert the xsl object to an xml_document object
style <- read_xml(saveXML(cbxsl$value()))
# Merge to an HTML document
htmldoc <- xml_xslt(doc,style)

At this point, we could export the htmldoc object as an HTML file and be done with it. However, if we’d like to take advantage of the formatting that knitr does so easily when making HTML documents of our code and comments (like the document you’re reading now), we can simply incorporate the object in the current document without having an extra HTML file sitting around.

Here’s the YAML to produce an HTML document with the nifty Table of Contents (TOC) you see on the left side:

---
title: "XML Codebook Example"
output: 
  html_document:
    toc: true
    toc_float: 
      smooth_scroll: true
---

You can also set your Rmd file to not display any of the code to produce the XML, XSL, and HTML files by putting the following code chunk at the top of the document:

# Be sure to set {r include=FALSE} so the first chunk does not get displayed
knitr::opts_chunk$set(echo = FALSE)

The code below will seamlessly integrate the htmldoc object, even including the variable names in the TOC. Remember how we coded those within <h2> tags? This is why!

# The next two lines allow the "includeHTML" function to use the 
# htmldoc object, which is otherwise optimized to import external HTML files.
fp <- file()
cat(as.character(htmldoc), file=fp)
# Plug in the HTML codebook
includeHTML(fp)

Open Science Aim 1 Codebook

Project contact person: Jenine Harris, harrisj@wustl.edu

Project description: Surveyed public health practitioners on reproducible research practices

Sample and survey procedures: Email invitation for a web-based survey to members of the American Public Health Association statistics section, posted web link on project Twitter feed coding2share, RWJF Twitter feed, and JPHMP Twitter feed

Sample size: 247 participants; 207 complete surveys

Data collection time period: Sept-Dec 2017

ID

Format:	numeric
Label:

Source

Format:	labelled
Label:	Source

Value	Label
1	Email
2	JPHMP
3	RWJF

StartDate

Format:	POSIXctPOSIXt
Label:	StartDate

EndDate

Format:	POSIXctPOSIXt
Label:	EndDate

Finished

Format:	numeric
Label:	Finished

Q2

Format:	numeric
Label:	Research Reproducibility in Public Health

Q3

Format:	numeric
Label:	Introduction

Q4_1

Format:	numeric
Label:	Primary job responsibilities-Data collection

Q4_2

Format:	numeric
Label:	Primary job responsibilities-Data management

Q4_3

Format:	numeric
Label:	Primary job responsibilities-Descriptive analysis

Q4_4

Format:	numeric
Label:	Primary job responsibilities-Inferential analysis

Q4_5

Format:	numeric
Label:	Primary job responsibilities-Visualization/graph production

Q4_6

Format:	numeric
Label:	Primary job responsibilities-Contributing to publications or reports

Q5

Format:	labelled
Label:	Publication work

Value	Label
1	Yes
2	No

Q7_1_1

Format:	numeric
Label:	Data and code storage-Data-In the cloud (e.g., Dropbox, GitHub)

Q7_1_2

Format:	numeric
Label:	Data and code storage-Data-On a local server at your workplace

Q7_1_3

Format:	numeric
Label:	Data and code storage-Data-On a desktop computer

Q7_1_4

Format:	numeric
Label:	Data and code storage-Data-On a laptop computer

Q7_1_5

Format:	numeric
Label:	Data and code storage-Data-On a removable storage device (e.g., USB drive, CD)

Q7_1_6

Format:	numeric
Label:	Data and code storage-Data-There were no statistical code files (e.g., used point-and-click/menu driven approach in SPSS, Excel, or other software)

Q7_2_1

Format:	numeric
Label:	Data and code storage-Statistical code-In the cloud (e.g., Dropbox, GitHub)

Q7_2_2

Format:	numeric
Label:	Data and code storage-Statistical code-On a local server at your workplace

Q7_2_3

Format:	numeric
Label:	Data and code storage-Statistical code-On a desktop computer

Q7_2_4

Format:	numeric
Label:	Data and code storage-Statistical code-On a laptop computer

Q7_2_5

Format:	numeric
Label:	Data and code storage-Statistical code-On a removable storage device (e.g., USB drive, CD)

Q7_2_6

Format:	numeric
Label:	Data and code storage-Statistical code-There were no statistical code files (e.g., used point-and-click/menu driven approach in SPSS, Excel, or other software)

Q8_1_1

Format:	numeric
Label:	Files storage location-Data-In project-specific folders or directories

Q8_1_2

Format:	numeric
Label:	Files storage location-Data-In folders or directories not specific to the project

Q8_2_1

Format:	numeric
Label:	Files storage location-Statistical code-In project-specific folders or directories

Q8_2_2

Format:	numeric
Label:	Files storage location-Statistical code-In folders or directories not specific to the project

Q9_1

Format:	labelled
Label:	Files organization-Data

Value	Label
1	In a single file
2	In multiple files

Q9_2

Format:	labelled
Label:	Files organization-Statistical code

Value	Label
1	In a single file
2	In multiple files

Q10_1_1

Format:	numeric
Label:	Files accessibility-Data-Shareable with anyone (public)

Q10_1_2

Format:	numeric
Label:	Files accessibility-Data-Shareable but secure (i.e., authorized access only)

Q10_1_3

Format:	numeric
Label:	Files accessibility-Data-Private

Q10_2_1

Format:	numeric
Label:	Files accessibility-Statistical code-Shareable with anyone (public)

Q10_2_2

Format:	numeric
Label:	Files accessibility-Statistical code-Shareable but secure (i.e., authorized access only)

Q10_2_3

Format:	numeric
Label:	Files accessibility-Statistical code-Private

Q11_1

Format:	labelled
Label:	Location ease-Data

Value	Label
1	Very easy
2	Somewhat easy
3	Somewhat difficult
4	Very difficult

Q11_2

Format:	labelled
Label:	Location ease-Statistical code

Value	Label
1	Very easy
2	Somewhat easy
3	Somewhat difficult
4	Very difficult

Q12_1

Format:	labelled
Label:	Files storage process-Data

Value	Label
1	No set process
2	A set process developed in-house
3	A set process developed externally

Q12_2

Format:	labelled
Label:	Files storage process-Statistical code

Value	Label
1	No set process
2	A set process developed in-house
3	A set process developed externally

Q13

Format:	numeric
Label:	Code annotation and documentation

Q14

Format:	labelled
Label:	Variable dictionary

Value	Label
1	Yes
2	No

Q15

Format:	labelled
Label:	Prolog

Value	Label
1	Yes
2	No

Q16_1

Format:	numeric
Label:	Prolog contents-Project name

Q16_2

Format:	numeric
Label:	Prolog contents-Code developer name

Q16_3

Format:	numeric
Label:	Prolog contents-Contact information for the code developer or another author

Q16_4

Format:	numeric
Label:	Prolog contents-The purpose of the code

Q16_5

Format:	numeric
Label:	Prolog contents-Information about the language/program used to run the code (e.g., software package, version of language)

Q16_6

Format:	numeric
Label:	Prolog contents-File paths or other locating information for the data

Q16_7

Format:	numeric
Label:	Prolog contents-File paths or other locating information for the statistical code

Q16_8

Format:	numeric
Label:	Prolog contents-The last date the code was edited

Q16_9

Format:	numeric
Label:	Prolog contents-Something else (please specify):

Q16_9_TEXT

Format:	character
Label:	Prolog contents-Something else (please specify):-TEXT

Q17

Format:	labelled
Label:	Statistical code comments

Value	Label
1	Yes
2	No

Q18

Format:	labelled
Label:	Code comment comprehensiveness

Value	Label
1	Comprehensive: most or all steps of analysis have comments
2	Moderate: comments used regularly throughout the code for new sections and clarification
3	Sparse: few or no comments used

Q19

Format:	labelled
Label:	Guidelines

Value	Label
1	Yes, closely followed guidelines
2	Yes, partially followed guidelines
3	No

Q20_1

Format:	numeric
Label:	Specific guidelines-Google's R Style Guide

Q20_2

Format:	numeric
Label:	Specific guidelines-Bioconductor's Coding Standards (R)

Q20_3

Format:	numeric
Label:	Specific guidelines-Hadley Wickham's Style Guide (R)

Q20_4

Format:	numeric
Label:	Specific guidelines-Henrik Bengtsson's R Coding Conventions

Q20_5

Format:	numeric
Label:	Specific guidelines-Colin Gillespie's R Style Guide

Q20_6

Format:	numeric
Label:	Specific guidelines-SAS Style Guide

Q20_7

Format:	numeric
Label:	Specific guidelines-Guidelines for Coding of SAS Programs

Q20_8

Format:	numeric
Label:	Specific guidelines-Suggestions on Stata Programming Style

Q20_9

Format:	numeric
Label:	Specific guidelines-Google Python Style Guide

Q20_10

Format:	numeric
Label:	Specific guidelines-GNU Mailman Style Guide (Python)

Q20_11

Format:	numeric
Label:	Specific guidelines-Style Guide for Python Code: PEP 8

Q20_12

Format:	numeric
Label:	Specific guidelines-Visual Basic Coding Conventions (Excel, Access)

Q20_13

Format:	numeric
Label:	Specific guidelines-MATLAB Style Guidelines

Q20_14

Format:	numeric
Label:	Specific guidelines-Something else (please specify):

Q20_14_TEXT

Format:	character
Label:	Specific guidelines-Something else (please specify):-TEXT

Q21_1

Format:	numeric
Label:	Reasons for following guidelines-Policy of my research team/lab

Q21_2

Format:	numeric
Label:	Reasons for following guidelines-Required for publication

Q21_3

Format:	numeric
Label:	Reasons for following guidelines-Makes my life easier

Q21_4

Format:	numeric
Label:	Reasons for following guidelines-It produces better code

Q21_5

Format:	numeric
Label:	Reasons for following guidelines-I was taught this way

Q21_6

Format:	numeric
Label:	Reasons for following guidelines-Improves collaboration

Q21_7

Format:	numeric
Label:	Reasons for following guidelines-Increases reproducibility

Q21_8

Format:	numeric
Label:	Reasons for following guidelines-Something else (please specify):

Q21_8_TEXT

Format:	character
Label:	Reasons for following guidelines-Something else (please specify):-TEXT

Q22_1

Format:	numeric
Label:	Code formatting-Used nouns for variables and/or verbs for functions

Q22_2

Format:	numeric
Label:	Code formatting-Limited lines of code to a certain length (e.g., 80 characters maximum)

Q22_3

Format:	numeric
Label:	Code formatting-Included metadata, such as the date or project name, in the file title

Q22_4

Format:	numeric
Label:	Code formatting-Included seed values for analyses that included randomness

Q22_5

Format:	numeric
Label:	Code formatting-Used a consistent way to name variables and functions (e.g., all lower case, camel case, meaningful words, etc)

Q22_6

Format:	numeric
Label:	Code formatting-Separated analysis steps with white space or blank lines

Q22_7

Format:	numeric
Label:	Code formatting-Used indentation to group lines of code within procedures/functions

Q22_8

Format:	numeric
Label:	Code formatting-Wrote functions for tasks repeated multiple times

Q22_9

Format:	numeric
Label:	Code formatting-Included some results within the annotation (e.g., mean age = 39, outcome did not meet normality assumption)

Q22_10

Format:	numeric
Label:	Code formatting-Integrated code with text and results into a single output, also known as literate programming (e.g., Sweave, Jupyter Notebook, R Markdown)

Q22_11

Format:	numeric
Label:	Code formatting-Something else (please specify):

Q22_11_TEXT

Format:	character
Label:	Code formatting-Something else (please specify):-TEXT

Q22_12

Format:	numeric
Label:	Code formatting-None of the above

Q23

Format:	numeric
Label:	Reproducibility of statistical analyses

Q24_1

Format:	numeric
Label:	Publication contents-Software used

Q24_2

Format:	numeric
Label:	Publication contents-Units of analysis (e.g., people, health departments)

Q24_3

Format:	numeric
Label:	Publication contents-Details on missing data handling

Q24_4

Format:	numeric
Label:	Publication contents-Variable recoding details

Q24_5

Format:	numeric
Label:	Publication contents-The name, units, and types of variable analyzed

Q24_6

Format:	numeric
Label:	Publication contents-The statistical approaches used (e.g., logistic regression, mixed-effects regression)

Q24_7

Format:	numeric
Label:	Publication contents-The type of test statistics computed (e.g., chi-squared, F)

Q24_8

Format:	numeric
Label:	Publication contents-The value of test statistics

Q24_9

Format:	numeric
Label:	Publication contents-The specific variables included in each statistical model

Q24_10

Format:	numeric
Label:	Publication contents-Sample sizes for each analysis

Q24_11

Format:	numeric
Label:	Publication contents-Precise p-values when possible (e.g., p=.02 rather than p

Q24_12

Format:	numeric
Label:	Publication contents-None of the above

Q25_1_1

Format:	numeric
Label:	Files created/made publicly available-Created-A clean version of data used for analyses

Q25_1_2

Format:	numeric
Label:	Files created/made publicly available-Created-Clean statistical code for results included in the publication

Q25_1_3

Format:	numeric
Label:	Files created/made publicly available-Created-A project directory with data and/or statistical code used for the publication

Q25_1_4

Format:	numeric
Label:	Files created/made publicly available-Created-A readme file explaining the data and/or statistical code

Q25_1_5

Format:	numeric
Label:	Files created/made publicly available-Created-None of the above

Q25_2_1

Format:	numeric
Label:	Files created/made publicly available-Made publicly available-A clean version of data used for analyses

Q25_2_2

Format:	numeric
Label:	Files created/made publicly available-Made publicly available-Clean statistical code for results included in the publication

Q25_2_3

Format:	numeric
Label:	Files created/made publicly available-Made publicly available-A project directory with data and/or statistical code used for the publication

Q25_2_4

Format:	numeric
Label:	Files created/made publicly available-Made publicly available-A readme file explaining the data and/or statistical code

Q25_2_5

Format:	numeric
Label:	Files created/made publicly available-Made publicly available-None of the above

Q26

Format:	labelled
Label:	Have you ever made your statistical code publicly available?

Value	Label
1	Yes
2	No

Q27

Format:	labelled
Label:	Have you ever published with public data?

Value	Label
1	Yes
2	No

Q28_1_1

Format:	numeric
Label:	Required to make publicly available?-Data-Yes, by a funder

Q28_1_2

Format:	numeric
Label:	Required to make publicly available?-Data-Yes, by the journal

Q28_1_3

Format:	numeric
Label:	Required to make publicly available?-Data-Yes, by my employer

Q28_1_4

Format:	numeric
Label:	Required to make publicly available?-Data-Yes, by my research team

Q28_1_5

Format:	numeric
Label:	Required to make publicly available?-Data-No, it was not required

Q28_2_1

Format:	numeric
Label:	Required to make publicly available?-Statistical code-Yes, by a funder

Q28_2_2

Format:	numeric
Label:	Required to make publicly available?-Statistical code-Yes, by the journal

Q28_2_3

Format:	numeric
Label:	Required to make publicly available?-Statistical code-Yes, by my employer

Q28_2_4

Format:	numeric
Label:	Required to make publicly available?-Statistical code-Yes, by my research team

Q28_2_5

Format:	numeric
Label:	Required to make publicly available?-Statistical code-No, it was not required

Q29

Format:	labelled
Label:	Code development

Value	Label
1	The code was developed by one person working alone
2	The code was developed by one person and other people checked the code (co-pilot strategy)
3	Code was developed in separate files by two or more people independently and then compared before choosing one or a combination of the two (i.e., parallel code development)
4	Code was developed by two or more people working together on the same coding file

Q30

Format:	numeric
Label:	Research reproducibility facilitators and barriers

Q31_1

Format:	numeric
Label:	Reproducible research facilitators-Additional human and financial resources

Q31_2

Format:	numeric
Label:	Reproducible research facilitators-Training on reproducible research practices

Q31_3

Format:	numeric
Label:	Reproducible research facilitators-Requirements by funders to disseminate data and statistical code

Q31_4

Format:	numeric
Label:	Reproducible research facilitators-Requirements by journals to include access to data and statistical code

Q31_5

Format:	numeric
Label:	Reproducible research facilitators-Workplace incentives (e.g., pay increases or more credit toward tenure for reproducible/reproduced work)

Q31_6

Format:	numeric
Label:	Reproducible research facilitators-Something else (please specify):

Q31_6_TEXT

Format:	character
Label:	Reproducible research facilitators-Something else (please specify):-TEXT

Q32_1

Format:	numeric
Label:	Reproducible research barriers-Lack of time

Q32_2

Format:	numeric
Label:	Reproducible research barriers-Lack of knowledge/training on reproducible research practices

Q32_3

Format:	numeric
Label:	Reproducible research barriers-Data privacy

Q32_4

Format:	numeric
Label:	Reproducible research barriers-Intellectual property concerns

Q32_5

Format:	numeric
Label:	Reproducible research barriers-Professional competition

Q32_6

Format:	numeric
Label:	Reproducible research barriers-Concerns of errors being discovered

Q32_7

Format:	numeric
Label:	Reproducible research barriers-Lack of incentive

Q32_8

Format:	numeric
Label:	Reproducible research barriers-Something else (please specify):

Q32_8_TEXT

Format:	character
Label:	Reproducible research barriers-Something else (please specify):-TEXT

Q32_9

Format:	numeric
Label:	Reproducible research barriers-No barriers experienced

Q32_10

Format:	numeric
Label:	Reproducible research barriers-I have not tried to make data or statistical code available

Q33

Format:	labelled
Label:	If training were available on reproducible research practices, how likely are you to participate?

Value	Label
1	Very likely
2	Somewhat likely
3	Somewhat unlikely
4	Very unlikely

Q34_1

Format:	numeric
Label:	Training format preference-In-person workshop or short course

Q34_2

Format:	numeric
Label:	Training format preference-In-person one-on-one assistance

Q34_3

Format:	numeric
Label:	Training format preference-Online asynchronous course

Q34_4

Format:	numeric
Label:	Training format preference-Online live webinar

Q34_5

Format:	numeric
Label:	Training format preference-Documents (e.g., manual, collection of resources, checklists, etc)

Q34_6

Format:	numeric
Label:	Training format preference-Blog or other living web presence

Q34_7

Format:	numeric
Label:	Training format preference-Social media site

Q34_8

Format:	numeric
Label:	Training format preference-Something else (please specify):

Q34_8_TEXT

Format:	character
Label:	Training format preference-Something else (please specify):-TEXT

Q35

Format:	numeric
Label:	Participant characteristics

Q36

Format:	labelled
Label:	Identify the statistical software you use the most often:

Value	Label
1	Excel
2	M-Plus
3	MatLab
4	Python
5	R
6	SAS
7	SPSS
8	Stata
9	Other (please specify)

Q36_TEXT

Format:	character
Label:	Identify the statistical software you use the most often:-TEXT

Q37_1

Format:	numeric
Label:	Top 2 resources-Software user manual(s)

Q37_2

Format:	numeric
Label:	Top 2 resources-Textbook or reference book

Q37_3

Format:	numeric
Label:	Top 2 resources-Internet search (i.e. Google)

Q37_4

Format:	numeric
Label:	Top 2 resources-YouTube video demonstration or something similar

Q37_5

Format:	numeric
Label:	Top 2 resources-In-person one-on-one help

Q37_6

Format:	numeric
Label:	Top 2 resources-Email help

Q37_7

Format:	numeric
Label:	Top 2 resources-Help via a chat feature (gchat or similar)

Q37_8

Format:	numeric
Label:	Top 2 resources-Online training or course

Q37_9

Format:	numeric
Label:	Top 2 resources-In-person training or course

Q37_10

Format:	numeric
Label:	Top 2 resources-Websites, blogs (i.e. Stack Exchange)

Q37_11

Format:	numeric
Label:	Top 2 resources-Something else (specify):

Q37_11_TEXT

Format:	character
Label:	Top 2 resources-Something else (specify):-TEXT

Q38

Format:	labelled
Label:	Degree

Value	Label
1	PhD
2	MD
3	MD/PhD
4	DrPH
5	DSc
6	EdD
7	MPH
8	MSW
9	MS or MSc
10	MRes or MPhil
11	MA
12	MBA
13	MAT
14	BS
15	BA
16	Something else (please specify):

Q38_TEXT

Format:	character
Label:	Degree-TEXT

Q39

Format:	character
Label:	Degree field

Q40

Format:	character
Label:	Job title

Q41

Format:	labelled
Label:	How long have you been in your current position?

Value	Label
1	Less than 1 year
2	1-3 years
3	4-10 years
4	More than 10 years

Q42

Format:	labelled
Label:	What type of organization are you in?

Value	Label
1	University/school
2	For-profit
3	Government
4	Nonprofit
5	Something else (please specify):

Q42_TEXT

Format:	character
Label:	What type of organization are you in?-TEXT

Q43

Format:	labelled
Label:	Gender

Value	Label
1	Male
2	Female
3	Transgender
4	Do not identify as female, male, or transgender

Q44_1

Format:	numeric
Label:	Race/ethnicity-White

Q44_2

Format:	numeric
Label:	Race/ethnicity-Hispanic, Latino, or Spanish origin

Q44_3

Format:	numeric
Label:	Race/ethnicity-Black or African American

Q44_4

Format:	numeric
Label:	Race/ethnicity-Asian

Q44_5

Format:	numeric
Label:	Race/ethnicity-American Indian or Alaska Native

Q44_6

Format:	numeric
Label:	Race/ethnicity-Middle Eastern or North African

Q44_7

Format:	numeric
Label:	Race/ethnicity-Native Hawaiian or Other Pacific Islander

Q44_8

Format:	numeric
Label:	Race/ethnicity-Some other race, ethnicity, or origin

Q45

Format:	numeric
Label:	Age

Q46

Format:	character
Label:	Suggestions

Q47

Format:	numeric
Label:	Thanks message