Introduction
This document builds:
- A machine-readable XML codebook
- A human-readable HTML codebook
Why would I do this to myself? XML looks a lot like HTML with the exception that you can put <otherStuff> inside of the tags. This makes the information nested inside of those tags selectable and readable by machines. The advantage is that if you’re sharing a large dataset in a non-proprietary format (such as CSV) that otherwise can’t easily label things like SPSS and SAS can, you can also share the XML codebook that can be queried to pull the labels.
But XML files are ugly! Yes. Yes they are. That’s why we’ll also walk through how to write an XSL file (which is really just another XML file) that will arrange things the way we want them. Pair it to the XML file, add a little CSS if you really wanna go crazy and spice things up, and they produce a beautiful HTML baby that is more pleasing to the human eye.
You’ll need a solid working knowledge of R and HTML and/or a willingness to get in over your head for a bit before you eventually figure it out.
For this exercise, we’ll walk through the SPSS dataset from the survey we conducted to guide us through module development and can be pulled off of the coding2share GitHub page. Fire up a new Rmd file and dive in.
library(haven) # read/write SPSS
library(XML) # convert lists to xml
library(xslt) # compile xml and xsl to html
## Loading required package: xml2
library(htmltools) # embed HTML content
dat <- read_sav("https://github.com/coding2share/OpenSciSurveyPaper/blob/master/OpenScienceAim1.sav?raw=true") # Pull data off of GitHub
Pull lists from data
SPSS files, when set up properly, contain metadata that are useful for building codebooks. Using lapply with the functions that pull these characteristics (such as attr(dat$var, "label") from a particular variable) will iterate through the entire dataset and pull that characteristic for each variable and stack them into an orderly list.
Note that you’ll need to deal with any special characters that may be present in any of the text that gets pulled, as seen with the gsub function below. If you have a lot of variables, they may not be worth hunting for until you look at the finished XML file. If you get an “XML Parsing Error: not well-formed” message or something similar when trying to open the file, it’s probably a special character issue; it should tell you where the problem is and you can go back and replace the characters. This is the easiest place to do so.
# Variable names
varNames <- as.list(names(dat))
names(varNames) <- names(dat)
# Variable labels
varLabs <- lapply(dat, attr, "label")
varLabs <- lapply(varLabs, function(x) gsub("[\u2019]", "'", x)) # remove curly apostrophe
# Formats
varForm <- lapply(dat, class)
# Value labels
valLabs <- lapply(dat, attr, "labels")
Build the codebook XML tree
The file is easiest to build from the bottom up. The hierarchy is as follows:
- study title
- study summary
- variable 1
- name
- format
- label
- value codes (if applicable)
- value-label pair 1
- value-label pair 2 (etc)
- variable 2 (etc)
The following code will loop through all of the variables and value codes to pull everything that is needed.
First level
Build the value and value-label pairs:
# Code values
vval <- lapply(valLabs, function(x){
lapply(x, function(x) newXMLNode(name="val",x))
} )
# Value labels
# Pull label strings
varValLabs <- lapply(valLabs, attr, "names")
vvlabs <- lapply(varValLabs, function(x){
lapply(x, function(x) newXMLNode(name="codeLabel",x))
})
# Check Source example
vval[["Source"]]
## $Email
## <val>1</val>
##
## $JPHMP
## <val>2</val>
##
## $RWJF
## <val>3</val>
vvlabs[["Source"]]
## [[1]]
## <codeLabel>Email</codeLabel>
##
## [[2]]
## <codeLabel>JPHMP</codeLabel>
##
## [[3]]
## <codeLabel>RWJF</codeLabel>
Second level
Nest val and label inside <pair> tags:
# Add pair as a parent to the values
pairs <- lapply(vval, function(x){
lapply(x, function(x) newXMLNode(name="pair", .children=list(x)))
})
# Add labels as a child to pairs
for(i in 1:length(pairs)){
if (length(pairs[[i]]) > 0){ # otherwise hangs for vars with no value/label pairs
for(j in 1:length(pairs[[i]])){
addChildren(pairs[[i]][[j]], kids=list(vvlabs[[i]][[j]]))
}
}
}
# Check Source example
pairs[["Source"]]
## $Email
## <pair>
## <val>1</val>
## <codeLabel>Email</codeLabel>
## </pair>
##
## $JPHMP
## <pair>
## <val>2</val>
## <codeLabel>JPHMP</codeLabel>
## </pair>
##
## $RWJF
## <pair>
## <val>3</val>
## <codeLabel>RWJF</codeLabel>
## </pair>
The important part of the structure here is that each value and each label are appropriately nested in their own pairs. Note also that the code has done this for all variables that have labels for their values.
Third level
Wrap up value codes, build name, format, and labels:
# Add value label pairs to parents called "codes"
vcodes <- lapply(pairs, function(x)
newXMLNode(name="codes", .children=list(x))
)
# Pull variable names, formats, formats, and labels
vnames <- lapply(varNames, function(x) newXMLNode(name="name",x))
vform <- lapply(varForm, function(x) newXMLNode(name="format",x))
vlabs <- lapply(varLabs, function(x) newXMLNode(name="varLabel",x))
Fourth level
Wrap up the variables:
# Add vnames,vform, vlabs, and codes to parents called "var"
vars <- mapply(function(w,x,y,z)
newXMLNode(name="var", .children=list(w,x,y,z)),
vnames,vform,vlabs,vcodes)
# Check Source example
vars[["Source"]]
## <var>
## <name>Source</name>
## <format>labelled</format>
## <varLabel>Source</varLabel>
## <codes>
## <pair>
## <val>1</val>
## <codeLabel>Email</codeLabel>
## </pair>
## <pair>
## <val>2</val>
## <codeLabel>JPHMP</codeLabel>
## </pair>
## <pair>
## <val>3</val>
## <codeLabel>RWJF</codeLabel>
## </pair>
## </codes>
## </var>
Here you can see how all of the information for each variable is wrapped up in nested levels.
Add title and summary:
# Title
cTitle <- newXMLNode(name="studyTitle", "Open Science Aim 1 Codebook")
# Summary
cSum <- newXMLNode(name="summary",
newXMLNode(name="lin", "Project contact person: Jenine Harris, harrisj@wustl.edu"),
newXMLNode(name="lin", "Project description: Surveyed public health practitioners on reproducible research practices"),
newXMLNode(name="lin", "Sample and survey procedures: Email invitation for a web-based survey to members of the American Public Health Association statistics section, posted web link on project Twitter feed coding2share, RWJF Twitter feed, and JPHMP Twitter feed"),
newXMLNode(name="lin", "Sample size: 247 participants; 207 complete surveys"),
newXMLNode(name="lin", "Data collection time period: Sept-Dec 2017")
)
cSum
## <summary>
## <lin>Project contact person: Jenine Harris, harrisj@wustl.edu</lin>
## <lin>Project description: Surveyed public health practitioners on reproducible research practices</lin>
## <lin>Sample and survey procedures: Email invitation for a web-based survey to members of the American Public Health Association statistics section, posted web link on project Twitter feed coding2share, RWJF Twitter feed, and JPHMP Twitter feed</lin>
## <lin>Sample size: 247 participants; 207 complete surveys</lin>
## <lin>Data collection time period: Sept-Dec 2017</lin>
## </summary>
Here you can see how each line of the summary is wrapped in a <lin> tag.
Top level
Wrap title, summary, and variables into an overall codebook node:
# Add vars to main codebook node
cb <- newXMLNode(name="codebook", .children=list(cTitle, cSum, vars))
Export the XML file:
saveXML(cb, file="OpenScienceAim1Codebook.xml", encoding="UTF-8")
At this point, it’s a good idea to open the XML file in the web browser of your choice and see if things generally look like they should.
Write the XSL formatting file
The XML codebook file created above is great for machines to read, but less great for humans to read. To fix that, we’ll create an XSL file that will arrange the XML file in the layout of our choosing, giving us a much more appealing HTML codebook file.
Having an idea of what you want the layout for the codebook to be is enormously helpful, as there are a number of ways to crumble this cookie. Here’s roughly what I’m shooting for in this example:
| Format: |
Format of var1 |
| Label: |
Label of var1 |
| 1 |
Label 1 |
| 2 |
Label 2 |
| 3 |
Label 3 |
We’ll build one table for each variable with format and label information, and an additional table for variables with value labels. On the downside, building tables in HTML is tricky. On the upside, we only have to define each kind of table once. The XML will cycle through all of the variables and things will show up if they’re present, and if not (as in the case of variables without value labels), there’s no squawking.
The following tags set up the structure for building tables in HTML:
<tr> for Table Row, which wraps around:
<th> for Table Head (header cell)
<td> for Table Data (regular cell)
We’ll also include the usual <h1> and <p> tags where necessary for text outside of the tables.
Within those tags, we’ll take advantage of the machine-readability of the XML file to pull that information into the tables. Within each header or data cell, we’ll use the <xsl:value-of select="xyz"> tag, where “xyz” corresponds to the tags we set up in the XML file: <format>, <varLabel>, etc.
Instead of building from the bottom up like we did for the XML file, this time we’ll build from the top down. We’ll need to include some namespace specifications, and a definition has to happen at the top to cover everything. We’ll also need to use the xmlTree method instead, as it allows us to nest XML tags within html style tags more easily.
# Doc level
cbxsl <- xmlTree("xsl:stylesheet", namespaces=list(xsl="http://www.w3.org/1999/XSL/Transform"),
attrs=c(version="1.0"), doc=newXMLDoc())
cbxsl$addNode("xsl:template", attrs=c(match="/"), close=FALSE)
cbxsl$addTag("html", close=FALSE)
cbxsl$addTag("head",
cbxsl$addTag("style", # CSS code below
"
h1 {
color: #0000b2;
}
h2 {
font-size: 16px;
color: #0000b2;
}
th, td {
text-align: left;
vertical-align: top;
padding-right: 10px;
}
"
)
)
cbxsl$addTag("body", close=FALSE)
cbxsl$addNode("xsl:for-each", attrs=c(select="codebook"), close=FALSE)
cbxsl$addTag("h1",
cbxsl$addNode("xsl:value-of", attrs=c(select="studyTitle"))
)
cbxsl$addNode("xsl:for-each", attrs=c(select="summary"), close=FALSE)
cbxsl$addNode("xsl:for-each",
cbxsl$addTag("p", # Take all of the summary elements
cbxsl$addNode("xsl:value-of", attrs=c(select="."))
),
attrs=c(select="./*"))
cbxsl$closeNode() # close summary node
# Variable level
cbxsl$addNode("xsl:for-each", attrs=c(select="var"), close=FALSE)
cbxsl$addTag("h2",
cbxsl$addNode("xsl:value-of", attrs=c(select="name"))
)
cbxsl$addTag("table", close=FALSE) # start Format/Label table
cbxsl$addTag("tr",
cbxsl$addTag("td","Format:"),
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="format"))
)
)
cbxsl$addTag("tr",
cbxsl$addTag("td", "Label:"),
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="varLabel"))
)
)
cbxsl$closeTag() # close Format/Label table
# Value/code level
# Only pulls for variables that have value/code pairs
cbxsl$addNode("xsl:for-each", attrs=c(select="codes[.!='']"), close=FALSE)
cbxsl$addTag("table", close=FALSE) # start Value/Label table
cbxsl$addTag("tr", # header row
cbxsl$addTag("th","Value"),
cbxsl$addTag("th","Label")
)
cbxsl$addNode("xsl:for-each", attrs=c(select="pair"), close=FALSE)
cbxsl$addTag("tr", # value/label rows
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="val"))
),
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="codeLabel"))
)
)
cbxsl$closeNode() # close pair node
cbxsl$closeTag() # close Value/Label table
cbxsl$closeNode() # close value/code level
cbxsl$addTag("hr") # horizontal rule between variables
# Take a look
cbxsl$value()
## <?xml version="1.0"?>
## <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
## <xsl:template match="/">
## <html>
## <head>
## <style>
## h1 {
## color: #0000b2;
## }
##
## h2 {
## font-size: 16px;
## color: #0000b2;
## }
##
## th, td {
## text-align: left;
## vertical-align: top;
## padding-right: 10px;
## }
##
## </style>
## </head>
## <body>
## <xsl:for-each select="codebook">
## <h1>
## <xsl:value-of select="studyTitle"/>
## </h1>
## <xsl:for-each select="summary">
## <xsl:for-each select="./*">
## <p>
## <xsl:value-of select="."/>
## </p>
## </xsl:for-each>
## </xsl:for-each>
## <xsl:for-each select="var">
## <h2>
## <xsl:value-of select="name"/>
## </h2>
## <table>
## <tr>
## <td>Format:</td>
## <td>
## <xsl:value-of select="format"/>
## </td>
## </tr>
## <tr>
## <td>Label:</td>
## <td>
## <xsl:value-of select="varLabel"/>
## </td>
## </tr>
## </table>
## <xsl:for-each select="codes[.!='']">
## <table>
## <tr>
## <th>Value</th>
## <th>Label</th>
## </tr>
## <xsl:for-each select="pair">
## <tr>
## <td>
## <xsl:value-of select="val"/>
## </td>
## <td>
## <xsl:value-of select="codeLabel"/>
## </td>
## </tr>
## </xsl:for-each>
## </table>
## </xsl:for-each>
## <hr/>
## </xsl:for-each>
## </xsl:for-each>
## </body>
## </html>
## </xsl:template>
## </xsl:stylesheet>
##
The strategy here is to be mindful about when you want to nest inside of a node and when you just want to move on to the next node. The default behavior for addNode and addTag is to close the node and move on, so set close=FALSE if you want to nest children within it, then follow with closeNode() or closeTag() when you’re done with the children. Alternatively, you can add children by nesting functions. For example:
cbxsl$addTag("tr",
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="val"))
),
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="codeLabel"))
)
)
produces the following:
<tr>
<td>
<xsl:value-of select="val"/>
</td>
<td>
<xsl:value-of select="codeLabel"/>
</td>
</tr>
Note that it’s probably easier to build this structure by hand in RStudio; it’s handy with closing the tags for you and keeping track of your indents. But if you want to keep everything in one Rmd document, xmlTree would be the way to go.
Compile
Merging the XML and XSL files into an HTML file is simple with the xml_xslt function:
# Bring in the external xml codebook file
doc <- read_xml("OpenScienceAim1Codebook.xml")
# Convert the xsl object to an xml_document object
style <- read_xml(saveXML(cbxsl$value()))
# Merge to an HTML document
htmldoc <- xml_xslt(doc,style)
At this point, we could export the htmldoc object as an HTML file and be done with it. However, if we’d like to take advantage of the formatting that knitr does so easily when making HTML documents of our code and comments (like the document you’re reading now), we can simply incorporate the object in the current document without having an extra HTML file sitting around.
Here’s the YAML to produce an HTML document with the nifty Table of Contents (TOC) you see on the left side:
---
title: "XML Codebook Example"
output:
html_document:
toc: true
toc_float:
smooth_scroll: true
---
You can also set your Rmd file to not display any of the code to produce the XML, XSL, and HTML files by putting the following code chunk at the top of the document:
# Be sure to set {r include=FALSE} so the first chunk does not get displayed
knitr::opts_chunk$set(echo = FALSE)
The code below will seamlessly integrate the htmldoc object, even including the variable names in the TOC. Remember how we coded those within <h2> tags? This is why!
# The next two lines allow the "includeHTML" function to use the
# htmldoc object, which is otherwise optimized to import external HTML files.
fp <- file()
cat(as.character(htmldoc), file=fp)
# Plug in the HTML codebook
includeHTML(fp)
Open Science Aim 1 Codebook
Project contact person: Jenine Harris, harrisj@wustl.edu
Project description: Surveyed public health practitioners on reproducible research practices
Sample and survey procedures: Email invitation for a web-based survey to members of the American Public Health Association statistics section, posted web link on project Twitter feed coding2share, RWJF Twitter feed, and JPHMP Twitter feed
Sample size: 247 participants; 207 complete surveys
Data collection time period: Sept-Dec 2017
ID
Source
| Format: |
labelled |
| Label: |
Source |
| Value |
Label |
| 1 |
Email |
| 2 |
JPHMP |
| 3 |
RWJF |
StartDate
| Format: |
POSIXctPOSIXt |
| Label: |
StartDate |
EndDate
| Format: |
POSIXctPOSIXt |
| Label: |
EndDate |
Finished
| Format: |
numeric |
| Label: |
Finished |
Q2
| Format: |
numeric |
| Label: |
Research Reproducibility in Public Health |
Q3
| Format: |
numeric |
| Label: |
Introduction |
Q4_1
| Format: |
numeric |
| Label: |
Primary job responsibilities-Data collection |
Q4_2
| Format: |
numeric |
| Label: |
Primary job responsibilities-Data management |
Q4_3
| Format: |
numeric |
| Label: |
Primary job responsibilities-Descriptive analysis |
Q4_4
| Format: |
numeric |
| Label: |
Primary job responsibilities-Inferential analysis |
Q4_5
| Format: |
numeric |
| Label: |
Primary job responsibilities-Visualization/graph production |
Q4_6
| Format: |
numeric |
| Label: |
Primary job responsibilities-Contributing to publications or reports |
Q5
| Format: |
labelled |
| Label: |
Publication work |
Q7_1_1
| Format: |
numeric |
| Label: |
Data and code storage-Data-In the cloud (e.g., Dropbox, GitHub) |
Q7_1_2
| Format: |
numeric |
| Label: |
Data and code storage-Data-On a local server at your workplace |
Q7_1_3
| Format: |
numeric |
| Label: |
Data and code storage-Data-On a desktop computer |
Q7_1_4
| Format: |
numeric |
| Label: |
Data and code storage-Data-On a laptop computer |
Q7_1_5
| Format: |
numeric |
| Label: |
Data and code storage-Data-On a removable storage device (e.g., USB drive, CD) |
Q7_1_6
| Format: |
numeric |
| Label: |
Data and code storage-Data-There were no statistical code files (e.g., used point-and-click/menu driven approach in SPSS, Excel, or other software) |
Q7_2_1
| Format: |
numeric |
| Label: |
Data and code storage-Statistical code-In the cloud (e.g., Dropbox, GitHub) |
Q7_2_2
| Format: |
numeric |
| Label: |
Data and code storage-Statistical code-On a local server at your workplace |
Q7_2_3
| Format: |
numeric |
| Label: |
Data and code storage-Statistical code-On a desktop computer |
Q7_2_4
| Format: |
numeric |
| Label: |
Data and code storage-Statistical code-On a laptop computer |
Q7_2_5
| Format: |
numeric |
| Label: |
Data and code storage-Statistical code-On a removable storage device (e.g., USB drive, CD) |
Q7_2_6
| Format: |
numeric |
| Label: |
Data and code storage-Statistical code-There were no statistical code files (e.g., used point-and-click/menu driven approach in SPSS, Excel, or other software) |
Q8_1_1
| Format: |
numeric |
| Label: |
Files storage location-Data-In project-specific folders or directories |
Q8_1_2
| Format: |
numeric |
| Label: |
Files storage location-Data-In folders or directories not specific to the project |
Q8_2_1
| Format: |
numeric |
| Label: |
Files storage location-Statistical code-In project-specific folders or directories |
Q8_2_2
| Format: |
numeric |
| Label: |
Files storage location-Statistical code-In folders or directories not specific to the project |
Q9_1
| Format: |
labelled |
| Label: |
Files organization-Data |
| Value |
Label |
| 1 |
In a single file |
| 2 |
In multiple files |
Q9_2
| Format: |
labelled |
| Label: |
Files organization-Statistical code |
| Value |
Label |
| 1 |
In a single file |
| 2 |
In multiple files |
Q10_1_1
| Format: |
numeric |
| Label: |
Files accessibility-Data-Shareable with anyone (public) |
Q10_1_2
| Format: |
numeric |
| Label: |
Files accessibility-Data-Shareable but secure (i.e., authorized access only) |
Q10_1_3
| Format: |
numeric |
| Label: |
Files accessibility-Data-Private |
Q10_2_1
| Format: |
numeric |
| Label: |
Files accessibility-Statistical code-Shareable with anyone (public) |
Q10_2_2
| Format: |
numeric |
| Label: |
Files accessibility-Statistical code-Shareable but secure (i.e., authorized access only) |
Q10_2_3
| Format: |
numeric |
| Label: |
Files accessibility-Statistical code-Private |
Q11_1
| Format: |
labelled |
| Label: |
Location ease-Data |
| Value |
Label |
| 1 |
Very easy |
| 2 |
Somewhat easy |
| 3 |
Somewhat difficult |
| 4 |
Very difficult |
Q11_2
| Format: |
labelled |
| Label: |
Location ease-Statistical code |
| Value |
Label |
| 1 |
Very easy |
| 2 |
Somewhat easy |
| 3 |
Somewhat difficult |
| 4 |
Very difficult |
Q12_1
| Format: |
labelled |
| Label: |
Files storage process-Data |
| Value |
Label |
| 1 |
No set process |
| 2 |
A set process developed in-house |
| 3 |
A set process developed externally |
Q12_2
| Format: |
labelled |
| Label: |
Files storage process-Statistical code |
| Value |
Label |
| 1 |
No set process |
| 2 |
A set process developed in-house |
| 3 |
A set process developed externally |
Q13
| Format: |
numeric |
| Label: |
Code annotation and documentation |
Q14
| Format: |
labelled |
| Label: |
Variable dictionary |
Q15
| Format: |
labelled |
| Label: |
Prolog |
Q16_1
| Format: |
numeric |
| Label: |
Prolog contents-Project name |
Q16_2
| Format: |
numeric |
| Label: |
Prolog contents-Code developer name |
Q16_3
| Format: |
numeric |
| Label: |
Prolog contents-Contact information for the code developer or another author |
Q16_4
| Format: |
numeric |
| Label: |
Prolog contents-The purpose of the code |
Q16_5
| Format: |
numeric |
| Label: |
Prolog contents-Information about the language/program used to run the code (e.g., software package, version of language) |
Q16_6
| Format: |
numeric |
| Label: |
Prolog contents-File paths or other locating information for the data |
Q16_7
| Format: |
numeric |
| Label: |
Prolog contents-File paths or other locating information for the statistical code |
Q16_8
| Format: |
numeric |
| Label: |
Prolog contents-The last date the code was edited |
Q16_9
| Format: |
numeric |
| Label: |
Prolog contents-Something else (please specify): |
Q16_9_TEXT
| Format: |
character |
| Label: |
Prolog contents-Something else (please specify):-TEXT |
Q17
| Format: |
labelled |
| Label: |
Statistical code comments |
Q18
| Format: |
labelled |
| Label: |
Code comment comprehensiveness |
| Value |
Label |
| 1 |
Comprehensive: most or all steps of analysis have comments |
| 2 |
Moderate: comments used regularly throughout the code for new sections and clarification |
| 3 |
Sparse: few or no comments used |
Q19
| Format: |
labelled |
| Label: |
Guidelines |
| Value |
Label |
| 1 |
Yes, closely followed guidelines |
| 2 |
Yes, partially followed guidelines |
| 3 |
No |
Q20_1
| Format: |
numeric |
| Label: |
Specific guidelines-Google's R Style Guide |
Q20_2
| Format: |
numeric |
| Label: |
Specific guidelines-Bioconductor's Coding Standards (R) |
Q20_3
| Format: |
numeric |
| Label: |
Specific guidelines-Hadley Wickham's Style Guide (R) |
Q20_4
| Format: |
numeric |
| Label: |
Specific guidelines-Henrik Bengtsson's R Coding Conventions |
Q20_5
| Format: |
numeric |
| Label: |
Specific guidelines-Colin Gillespie's R Style Guide |
Q20_6
| Format: |
numeric |
| Label: |
Specific guidelines-SAS Style Guide |
Q20_7
| Format: |
numeric |
| Label: |
Specific guidelines-Guidelines for Coding of SAS Programs |
Q20_8
| Format: |
numeric |
| Label: |
Specific guidelines-Suggestions on Stata Programming Style |
Q20_9
| Format: |
numeric |
| Label: |
Specific guidelines-Google Python Style Guide |
Q20_10
| Format: |
numeric |
| Label: |
Specific guidelines-GNU Mailman Style Guide (Python) |
Q20_11
| Format: |
numeric |
| Label: |
Specific guidelines-Style Guide for Python Code: PEP 8 |
Q20_12
| Format: |
numeric |
| Label: |
Specific guidelines-Visual Basic Coding Conventions (Excel, Access) |
Q20_13
| Format: |
numeric |
| Label: |
Specific guidelines-MATLAB Style Guidelines |
Q20_14
| Format: |
numeric |
| Label: |
Specific guidelines-Something else (please specify): |
Q20_14_TEXT
| Format: |
character |
| Label: |
Specific guidelines-Something else (please specify):-TEXT |
Q21_1
| Format: |
numeric |
| Label: |
Reasons for following guidelines-Policy of my research team/lab |
Q21_2
| Format: |
numeric |
| Label: |
Reasons for following guidelines-Required for publication |
Q21_3
| Format: |
numeric |
| Label: |
Reasons for following guidelines-Makes my life easier |
Q21_4
| Format: |
numeric |
| Label: |
Reasons for following guidelines-It produces better code |
Q21_5
| Format: |
numeric |
| Label: |
Reasons for following guidelines-I was taught this way |
Q21_6
| Format: |
numeric |
| Label: |
Reasons for following guidelines-Improves collaboration |
Q21_7
| Format: |
numeric |
| Label: |
Reasons for following guidelines-Increases reproducibility |
Q21_8
| Format: |
numeric |
| Label: |
Reasons for following guidelines-Something else (please specify): |
Q21_8_TEXT
| Format: |
character |
| Label: |
Reasons for following guidelines-Something else (please specify):-TEXT |
Q22_1
| Format: |
numeric |
| Label: |
Code formatting-Used nouns for variables and/or verbs for functions |
Q22_2
| Format: |
numeric |
| Label: |
Code formatting-Limited lines of code to a certain length (e.g., 80 characters maximum) |
Q22_3
| Format: |
numeric |
| Label: |
Code formatting-Included metadata, such as the date or project name, in the file title |
Q22_4
| Format: |
numeric |
| Label: |
Code formatting-Included seed values for analyses that included randomness |
Q22_5
| Format: |
numeric |
| Label: |
Code formatting-Used a consistent way to name variables and functions (e.g., all lower case, camel case, meaningful words, etc) |
Q22_6
| Format: |
numeric |
| Label: |
Code formatting-Separated analysis steps with white space or blank lines |
Q22_7
| Format: |
numeric |
| Label: |
Code formatting-Used indentation to group lines of code within procedures/functions |
Q22_8
| Format: |
numeric |
| Label: |
Code formatting-Wrote functions for tasks repeated multiple times |
Q22_9
| Format: |
numeric |
| Label: |
Code formatting-Included some results within the annotation (e.g., mean age = 39, outcome did not meet normality assumption) |
Q22_10
| Format: |
numeric |
| Label: |
Code formatting-Integrated code with text and results into a single output, also known as literate programming (e.g., Sweave, Jupyter Notebook, R Markdown) |
Q22_11
| Format: |
numeric |
| Label: |
Code formatting-Something else (please specify): |
Q22_11_TEXT
| Format: |
character |
| Label: |
Code formatting-Something else (please specify):-TEXT |
Q22_12
| Format: |
numeric |
| Label: |
Code formatting-None of the above |
Q23
| Format: |
numeric |
| Label: |
Reproducibility of statistical analyses |
Q24_1
| Format: |
numeric |
| Label: |
Publication contents-Software used |
Q24_2
| Format: |
numeric |
| Label: |
Publication contents-Units of analysis (e.g., people, health departments) |
Q24_3
| Format: |
numeric |
| Label: |
Publication contents-Details on missing data handling |
Q24_4
| Format: |
numeric |
| Label: |
Publication contents-Variable recoding details |
Q24_5
| Format: |
numeric |
| Label: |
Publication contents-The name, units, and types of variable analyzed |
Q24_6
| Format: |
numeric |
| Label: |
Publication contents-The statistical approaches used (e.g., logistic regression, mixed-effects regression) |
Q24_7
| Format: |
numeric |
| Label: |
Publication contents-The type of test statistics computed (e.g., chi-squared, F) |
Q24_8
| Format: |
numeric |
| Label: |
Publication contents-The value of test statistics |
Q24_9
| Format: |
numeric |
| Label: |
Publication contents-The specific variables included in each statistical model |
Q24_10
| Format: |
numeric |
| Label: |
Publication contents-Sample sizes for each analysis |
Q24_11
| Format: |
numeric |
| Label: |
Publication contents-Precise p-values when possible (e.g., p=.02 rather than p |
Q24_12
| Format: |
numeric |
| Label: |
Publication contents-None of the above |
Q25_1_1
| Format: |
numeric |
| Label: |
Files created/made publicly available-Created-A clean version of data used for analyses |
Q25_1_2
| Format: |
numeric |
| Label: |
Files created/made publicly available-Created-Clean statistical code for results included in the publication |
Q25_1_3
| Format: |
numeric |
| Label: |
Files created/made publicly available-Created-A project directory with data and/or statistical code used for the publication |
Q25_1_4
| Format: |
numeric |
| Label: |
Files created/made publicly available-Created-A readme file explaining the data and/or statistical code |
Q25_1_5
| Format: |
numeric |
| Label: |
Files created/made publicly available-Created-None of the above |
Q25_2_1
| Format: |
numeric |
| Label: |
Files created/made publicly available-Made publicly available-A clean version of data used for analyses |
Q25_2_2
| Format: |
numeric |
| Label: |
Files created/made publicly available-Made publicly available-Clean statistical code for results included in the publication |
Q25_2_3
| Format: |
numeric |
| Label: |
Files created/made publicly available-Made publicly available-A project directory with data and/or statistical code used for the publication |
Q25_2_4
| Format: |
numeric |
| Label: |
Files created/made publicly available-Made publicly available-A readme file explaining the data and/or statistical code |
Q25_2_5
| Format: |
numeric |
| Label: |
Files created/made publicly available-Made publicly available-None of the above |
Q26
| Format: |
labelled |
| Label: |
Have you ever made your statistical code publicly available? |
Q27
| Format: |
labelled |
| Label: |
Have you ever published with public data? |
Q28_1_1
| Format: |
numeric |
| Label: |
Required to make publicly available?-Data-Yes, by a funder |
Q28_1_2
| Format: |
numeric |
| Label: |
Required to make publicly available?-Data-Yes, by the journal |
Q28_1_3
| Format: |
numeric |
| Label: |
Required to make publicly available?-Data-Yes, by my employer |
Q28_1_4
| Format: |
numeric |
| Label: |
Required to make publicly available?-Data-Yes, by my research team |
Q28_1_5
| Format: |
numeric |
| Label: |
Required to make publicly available?-Data-No, it was not required |
Q28_2_1
| Format: |
numeric |
| Label: |
Required to make publicly available?-Statistical code-Yes, by a funder |
Q28_2_2
| Format: |
numeric |
| Label: |
Required to make publicly available?-Statistical code-Yes, by the journal |
Q28_2_3
| Format: |
numeric |
| Label: |
Required to make publicly available?-Statistical code-Yes, by my employer |
Q28_2_4
| Format: |
numeric |
| Label: |
Required to make publicly available?-Statistical code-Yes, by my research team |
Q28_2_5
| Format: |
numeric |
| Label: |
Required to make publicly available?-Statistical code-No, it was not required |
Q29
| Format: |
labelled |
| Label: |
Code development |
| Value |
Label |
| 1 |
The code was developed by one person working alone |
| 2 |
The code was developed by one person and other people checked the code (co-pilot strategy) |
| 3 |
Code was developed in separate files by two or more people independently and then compared before choosing one or a combination of the two (i.e., parallel code development) |
| 4 |
Code was developed by two or more people working together on the same coding file |
Q30
| Format: |
numeric |
| Label: |
Research reproducibility facilitators and barriers |
Q31_1
| Format: |
numeric |
| Label: |
Reproducible research facilitators-Additional human and financial resources |
Q31_2
| Format: |
numeric |
| Label: |
Reproducible research facilitators-Training on reproducible research practices |
Q31_3
| Format: |
numeric |
| Label: |
Reproducible research facilitators-Requirements by funders to disseminate data and statistical code |
Q31_4
| Format: |
numeric |
| Label: |
Reproducible research facilitators-Requirements by journals to include access to data and statistical code |
Q31_5
| Format: |
numeric |
| Label: |
Reproducible research facilitators-Workplace incentives (e.g., pay increases or more credit toward tenure for reproducible/reproduced work) |
Q31_6
| Format: |
numeric |
| Label: |
Reproducible research facilitators-Something else (please specify): |
Q31_6_TEXT
| Format: |
character |
| Label: |
Reproducible research facilitators-Something else (please specify):-TEXT |
Q32_1
| Format: |
numeric |
| Label: |
Reproducible research barriers-Lack of time |
Q32_2
| Format: |
numeric |
| Label: |
Reproducible research barriers-Lack of knowledge/training on reproducible research practices |
Q32_3
| Format: |
numeric |
| Label: |
Reproducible research barriers-Data privacy |
Q32_4
| Format: |
numeric |
| Label: |
Reproducible research barriers-Intellectual property concerns |
Q32_5
| Format: |
numeric |
| Label: |
Reproducible research barriers-Professional competition |
Q32_6
| Format: |
numeric |
| Label: |
Reproducible research barriers-Concerns of errors being discovered |
Q32_7
| Format: |
numeric |
| Label: |
Reproducible research barriers-Lack of incentive |
Q32_8
| Format: |
numeric |
| Label: |
Reproducible research barriers-Something else (please specify): |
Q32_8_TEXT
| Format: |
character |
| Label: |
Reproducible research barriers-Something else (please specify):-TEXT |
Q32_9
| Format: |
numeric |
| Label: |
Reproducible research barriers-No barriers experienced |
Q32_10
| Format: |
numeric |
| Label: |
Reproducible research barriers-I have not tried to make data or statistical code available |
Q33
| Format: |
labelled |
| Label: |
If training were available on reproducible research practices, how likely are you to participate? |
| Value |
Label |
| 1 |
Very likely |
| 2 |
Somewhat likely |
| 3 |
Somewhat unlikely |
| 4 |
Very unlikely |
Q34_1
| Format: |
numeric |
| Label: |
Training format preference-In-person workshop or short course |
Q34_2
| Format: |
numeric |
| Label: |
Training format preference-In-person one-on-one assistance |
Q34_3
| Format: |
numeric |
| Label: |
Training format preference-Online asynchronous course |
Q34_4
| Format: |
numeric |
| Label: |
Training format preference-Online live webinar |
Q34_5
| Format: |
numeric |
| Label: |
Training format preference-Documents (e.g., manual, collection of resources, checklists, etc) |
Q34_6
| Format: |
numeric |
| Label: |
Training format preference-Blog or other living web presence |
Q34_7
| Format: |
numeric |
| Label: |
Training format preference-Social media site |
Q34_8
| Format: |
numeric |
| Label: |
Training format preference-Something else (please specify): |
Q34_8_TEXT
| Format: |
character |
| Label: |
Training format preference-Something else (please specify):-TEXT |
Q35
| Format: |
numeric |
| Label: |
Participant characteristics |
Q36
| Format: |
labelled |
| Label: |
Identify the statistical software you use the most often: |
| Value |
Label |
| 1 |
Excel |
| 2 |
M-Plus |
| 3 |
MatLab |
| 4 |
Python |
| 5 |
R |
| 6 |
SAS |
| 7 |
SPSS |
| 8 |
Stata |
| 9 |
Other (please specify) |
Q36_TEXT
| Format: |
character |
| Label: |
Identify the statistical software you use the most often:-TEXT |
Q37_1
| Format: |
numeric |
| Label: |
Top 2 resources-Software user manual(s) |
Q37_2
| Format: |
numeric |
| Label: |
Top 2 resources-Textbook or reference book |
Q37_3
| Format: |
numeric |
| Label: |
Top 2 resources-Internet search (i.e. Google) |
Q37_4
| Format: |
numeric |
| Label: |
Top 2 resources-YouTube video demonstration or something similar |
Q37_5
| Format: |
numeric |
| Label: |
Top 2 resources-In-person one-on-one help |
Q37_6
| Format: |
numeric |
| Label: |
Top 2 resources-Email help |
Q37_7
| Format: |
numeric |
| Label: |
Top 2 resources-Help via a chat feature (gchat or similar) |
Q37_8
| Format: |
numeric |
| Label: |
Top 2 resources-Online training or course |
Q37_9
| Format: |
numeric |
| Label: |
Top 2 resources-In-person training or course |
Q37_10
| Format: |
numeric |
| Label: |
Top 2 resources-Websites, blogs (i.e. Stack Exchange) |
Q37_11
| Format: |
numeric |
| Label: |
Top 2 resources-Something else (specify): |
Q37_11_TEXT
| Format: |
character |
| Label: |
Top 2 resources-Something else (specify):-TEXT |
Q38
| Format: |
labelled |
| Label: |
Degree |
| Value |
Label |
| 1 |
PhD |
| 2 |
MD |
| 3 |
MD/PhD |
| 4 |
DrPH |
| 5 |
DSc |
| 6 |
EdD |
| 7 |
MPH |
| 8 |
MSW |
| 9 |
MS or MSc |
| 10 |
MRes or MPhil |
| 11 |
MA |
| 12 |
MBA |
| 13 |
MAT |
| 14 |
BS |
| 15 |
BA |
| 16 |
Something else (please specify): |
Q38_TEXT
| Format: |
character |
| Label: |
Degree-TEXT |
Q39
| Format: |
character |
| Label: |
Degree field |
Q40
| Format: |
character |
| Label: |
Job title |
Q41
| Format: |
labelled |
| Label: |
How long have you been in your current position? |
| Value |
Label |
| 1 |
Less than 1 year |
| 2 |
1-3 years |
| 3 |
4-10 years |
| 4 |
More than 10 years |
Q42
| Format: |
labelled |
| Label: |
What type of organization are you in? |
| Value |
Label |
| 1 |
University/school |
| 2 |
For-profit |
| 3 |
Government |
| 4 |
Nonprofit |
| 5 |
Something else (please specify): |
Q42_TEXT
| Format: |
character |
| Label: |
What type of organization are you in?-TEXT |
Q43
| Format: |
labelled |
| Label: |
Gender |
| Value |
Label |
| 1 |
Male |
| 2 |
Female |
| 3 |
Transgender |
| 4 |
Do not identify as female, male, or transgender |
Q44_1
| Format: |
numeric |
| Label: |
Race/ethnicity-White |
Q44_2
| Format: |
numeric |
| Label: |
Race/ethnicity-Hispanic, Latino, or Spanish origin |
Q44_3
| Format: |
numeric |
| Label: |
Race/ethnicity-Black or African American |
Q44_4
| Format: |
numeric |
| Label: |
Race/ethnicity-Asian |
Q44_5
| Format: |
numeric |
| Label: |
Race/ethnicity-American Indian or Alaska Native |
Q44_6
| Format: |
numeric |
| Label: |
Race/ethnicity-Middle Eastern or North African |
Q44_7
| Format: |
numeric |
| Label: |
Race/ethnicity-Native Hawaiian or Other Pacific Islander |
Q44_8
| Format: |
numeric |
| Label: |
Race/ethnicity-Some other race, ethnicity, or origin |
Q45
| Format: |
numeric |
| Label: |
Age |
Q46
| Format: |
character |
| Label: |
Suggestions |
Q47
| Format: |
numeric |
| Label: |
Thanks message |