Introduction
This document builds:
- A machine-readable XML codebook
- A human-readable HTML codebook
Why would I do this to myself? XML looks a lot like HTML with the exception that you can put <otherStuff>
inside of the tags. This makes the information nested inside of those tags selectable and readable by machines. The advantage is that if you’re sharing a large dataset in a non-proprietary format (such as CSV) that otherwise can’t easily label things like SPSS and SAS can, you can also share the XML codebook that can be queried to pull the labels.
But XML files are ugly! Yes. Yes they are. That’s why we’ll also walk through how to write an XSL file (which is really just another XML file) that will arrange things the way we want them. Pair it to the XML file, add a little CSS if you really wanna go crazy and spice things up, and they produce a beautiful HTML baby that is more pleasing to the human eye.
You’ll need a solid working knowledge of R and HTML and/or a willingness to get in over your head for a bit before you eventually figure it out.
For this exercise, we’ll walk through the SPSS dataset from the survey we conducted to guide us through module development and can be pulled off of the coding2share GitHub page. Fire up a new Rmd file and dive in.
library(haven) # read/write SPSS
library(XML) # convert lists to xml
library(xslt) # compile xml and xsl to html
## Loading required package: xml2
library(htmltools) # embed HTML content
dat <- read_sav("https://github.com/coding2share/OpenSciSurveyPaper/blob/master/OpenScienceAim1.sav?raw=true") # Pull data off of GitHub
Pull lists from data
SPSS files, when set up properly, contain metadata that are useful for building codebooks. Using lapply
with the functions that pull these characteristics (such as attr(dat$var, "label")
from a particular variable) will iterate through the entire dataset and pull that characteristic for each variable and stack them into an orderly list.
Note that you’ll need to deal with any special characters that may be present in any of the text that gets pulled, as seen with the gsub
function below. If you have a lot of variables, they may not be worth hunting for until you look at the finished XML file. If you get an “XML Parsing Error: not well-formed” message or something similar when trying to open the file, it’s probably a special character issue; it should tell you where the problem is and you can go back and replace the characters. This is the easiest place to do so.
# Variable names
varNames <- as.list(names(dat))
names(varNames) <- names(dat)
# Variable labels
varLabs <- lapply(dat, attr, "label")
varLabs <- lapply(varLabs, function(x) gsub("[\u2019]", "'", x)) # remove curly apostrophe
# Formats
varForm <- lapply(dat, class)
# Value labels
valLabs <- lapply(dat, attr, "labels")
Build the codebook XML tree
The file is easiest to build from the bottom up. The hierarchy is as follows:
- study title
- study summary
- variable 1
- name
- format
- label
- value codes (if applicable)
- value-label pair 1
- value-label pair 2 (etc)
- variable 2 (etc)
The following code will loop through all of the variables and value codes to pull everything that is needed.
First level
Build the value and value-label pairs:
# Code values
vval <- lapply(valLabs, function(x){
lapply(x, function(x) newXMLNode(name="val",x))
} )
# Value labels
# Pull label strings
varValLabs <- lapply(valLabs, attr, "names")
vvlabs <- lapply(varValLabs, function(x){
lapply(x, function(x) newXMLNode(name="codeLabel",x))
})
# Check Source example
vval[["Source"]]
## $Email
## <val>1</val>
##
## $JPHMP
## <val>2</val>
##
## $RWJF
## <val>3</val>
vvlabs[["Source"]]
## [[1]]
## <codeLabel>Email</codeLabel>
##
## [[2]]
## <codeLabel>JPHMP</codeLabel>
##
## [[3]]
## <codeLabel>RWJF</codeLabel>
Second level
Nest val and label inside <pair>
tags:
# Add pair as a parent to the values
pairs <- lapply(vval, function(x){
lapply(x, function(x) newXMLNode(name="pair", .children=list(x)))
})
# Add labels as a child to pairs
for(i in 1:length(pairs)){
if (length(pairs[[i]]) > 0){ # otherwise hangs for vars with no value/label pairs
for(j in 1:length(pairs[[i]])){
addChildren(pairs[[i]][[j]], kids=list(vvlabs[[i]][[j]]))
}
}
}
# Check Source example
pairs[["Source"]]
## $Email
## <pair>
## <val>1</val>
## <codeLabel>Email</codeLabel>
## </pair>
##
## $JPHMP
## <pair>
## <val>2</val>
## <codeLabel>JPHMP</codeLabel>
## </pair>
##
## $RWJF
## <pair>
## <val>3</val>
## <codeLabel>RWJF</codeLabel>
## </pair>
The important part of the structure here is that each value and each label are appropriately nested in their own pairs. Note also that the code has done this for all variables that have labels for their values.
Third level
Wrap up value codes, build name, format, and labels:
# Add value label pairs to parents called "codes"
vcodes <- lapply(pairs, function(x)
newXMLNode(name="codes", .children=list(x))
)
# Pull variable names, formats, formats, and labels
vnames <- lapply(varNames, function(x) newXMLNode(name="name",x))
vform <- lapply(varForm, function(x) newXMLNode(name="format",x))
vlabs <- lapply(varLabs, function(x) newXMLNode(name="varLabel",x))
Fourth level
Wrap up the variables:
# Add vnames,vform, vlabs, and codes to parents called "var"
vars <- mapply(function(w,x,y,z)
newXMLNode(name="var", .children=list(w,x,y,z)),
vnames,vform,vlabs,vcodes)
# Check Source example
vars[["Source"]]
## <var>
## <name>Source</name>
## <format>labelled</format>
## <varLabel>Source</varLabel>
## <codes>
## <pair>
## <val>1</val>
## <codeLabel>Email</codeLabel>
## </pair>
## <pair>
## <val>2</val>
## <codeLabel>JPHMP</codeLabel>
## </pair>
## <pair>
## <val>3</val>
## <codeLabel>RWJF</codeLabel>
## </pair>
## </codes>
## </var>
Here you can see how all of the information for each variable is wrapped up in nested levels.
Add title and summary:
# Title
cTitle <- newXMLNode(name="studyTitle", "Open Science Aim 1 Codebook")
# Summary
cSum <- newXMLNode(name="summary",
newXMLNode(name="lin", "Project contact person: Jenine Harris, harrisj@wustl.edu"),
newXMLNode(name="lin", "Project description: Surveyed public health practitioners on reproducible research practices"),
newXMLNode(name="lin", "Sample and survey procedures: Email invitation for a web-based survey to members of the American Public Health Association statistics section, posted web link on project Twitter feed coding2share, RWJF Twitter feed, and JPHMP Twitter feed"),
newXMLNode(name="lin", "Sample size: 247 participants; 207 complete surveys"),
newXMLNode(name="lin", "Data collection time period: Sept-Dec 2017")
)
cSum
## <summary>
## <lin>Project contact person: Jenine Harris, harrisj@wustl.edu</lin>
## <lin>Project description: Surveyed public health practitioners on reproducible research practices</lin>
## <lin>Sample and survey procedures: Email invitation for a web-based survey to members of the American Public Health Association statistics section, posted web link on project Twitter feed coding2share, RWJF Twitter feed, and JPHMP Twitter feed</lin>
## <lin>Sample size: 247 participants; 207 complete surveys</lin>
## <lin>Data collection time period: Sept-Dec 2017</lin>
## </summary>
Here you can see how each line of the summary is wrapped in a <lin>
tag.
Top level
Wrap title, summary, and variables into an overall codebook node:
# Add vars to main codebook node
cb <- newXMLNode(name="codebook", .children=list(cTitle, cSum, vars))
Export the XML file:
saveXML(cb, file="OpenScienceAim1Codebook.xml", encoding="UTF-8")
At this point, it’s a good idea to open the XML file in the web browser of your choice and see if things generally look like they should.
Write the XSL formatting file
The XML codebook file created above is great for machines to read, but less great for humans to read. To fix that, we’ll create an XSL file that will arrange the XML file in the layout of our choosing, giving us a much more appealing HTML codebook file.
Having an idea of what you want the layout for the codebook to be is enormously helpful, as there are a number of ways to crumble this cookie. Here’s roughly what I’m shooting for in this example:
Format: |
Format of var1 |
Label: |
Label of var1 |
1 |
Label 1 |
2 |
Label 2 |
3 |
Label 3 |
We’ll build one table for each variable with format and label information, and an additional table for variables with value labels. On the downside, building tables in HTML is tricky. On the upside, we only have to define each kind of table once. The XML will cycle through all of the variables and things will show up if they’re present, and if not (as in the case of variables without value labels), there’s no squawking.
The following tags set up the structure for building tables in HTML:
<tr>
for Table Row, which wraps around:
<th>
for Table Head (header cell)
<td>
for Table Data (regular cell)
We’ll also include the usual <h1>
and <p>
tags where necessary for text outside of the tables.
Within those tags, we’ll take advantage of the machine-readability of the XML file to pull that information into the tables. Within each header or data cell, we’ll use the <xsl:value-of select="xyz">
tag, where “xyz” corresponds to the tags we set up in the XML file: <format>
, <varLabel>
, etc.
Instead of building from the bottom up like we did for the XML file, this time we’ll build from the top down. We’ll need to include some namespace specifications, and a definition has to happen at the top to cover everything. We’ll also need to use the xmlTree
method instead, as it allows us to nest XML tags within html style tags more easily.
# Doc level
cbxsl <- xmlTree("xsl:stylesheet", namespaces=list(xsl="http://www.w3.org/1999/XSL/Transform"),
attrs=c(version="1.0"), doc=newXMLDoc())
cbxsl$addNode("xsl:template", attrs=c(match="/"), close=FALSE)
cbxsl$addTag("html", close=FALSE)
cbxsl$addTag("head",
cbxsl$addTag("style", # CSS code below
"
h1 {
color: #0000b2;
}
h2 {
font-size: 16px;
color: #0000b2;
}
th, td {
text-align: left;
vertical-align: top;
padding-right: 10px;
}
"
)
)
cbxsl$addTag("body", close=FALSE)
cbxsl$addNode("xsl:for-each", attrs=c(select="codebook"), close=FALSE)
cbxsl$addTag("h1",
cbxsl$addNode("xsl:value-of", attrs=c(select="studyTitle"))
)
cbxsl$addNode("xsl:for-each", attrs=c(select="summary"), close=FALSE)
cbxsl$addNode("xsl:for-each",
cbxsl$addTag("p", # Take all of the summary elements
cbxsl$addNode("xsl:value-of", attrs=c(select="."))
),
attrs=c(select="./*"))
cbxsl$closeNode() # close summary node
# Variable level
cbxsl$addNode("xsl:for-each", attrs=c(select="var"), close=FALSE)
cbxsl$addTag("h2",
cbxsl$addNode("xsl:value-of", attrs=c(select="name"))
)
cbxsl$addTag("table", close=FALSE) # start Format/Label table
cbxsl$addTag("tr",
cbxsl$addTag("td","Format:"),
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="format"))
)
)
cbxsl$addTag("tr",
cbxsl$addTag("td", "Label:"),
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="varLabel"))
)
)
cbxsl$closeTag() # close Format/Label table
# Value/code level
# Only pulls for variables that have value/code pairs
cbxsl$addNode("xsl:for-each", attrs=c(select="codes[.!='']"), close=FALSE)
cbxsl$addTag("table", close=FALSE) # start Value/Label table
cbxsl$addTag("tr", # header row
cbxsl$addTag("th","Value"),
cbxsl$addTag("th","Label")
)
cbxsl$addNode("xsl:for-each", attrs=c(select="pair"), close=FALSE)
cbxsl$addTag("tr", # value/label rows
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="val"))
),
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="codeLabel"))
)
)
cbxsl$closeNode() # close pair node
cbxsl$closeTag() # close Value/Label table
cbxsl$closeNode() # close value/code level
cbxsl$addTag("hr") # horizontal rule between variables
# Take a look
cbxsl$value()
## <?xml version="1.0"?>
## <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
## <xsl:template match="/">
## <html>
## <head>
## <style>
## h1 {
## color: #0000b2;
## }
##
## h2 {
## font-size: 16px;
## color: #0000b2;
## }
##
## th, td {
## text-align: left;
## vertical-align: top;
## padding-right: 10px;
## }
##
## </style>
## </head>
## <body>
## <xsl:for-each select="codebook">
## <h1>
## <xsl:value-of select="studyTitle"/>
## </h1>
## <xsl:for-each select="summary">
## <xsl:for-each select="./*">
## <p>
## <xsl:value-of select="."/>
## </p>
## </xsl:for-each>
## </xsl:for-each>
## <xsl:for-each select="var">
## <h2>
## <xsl:value-of select="name"/>
## </h2>
## <table>
## <tr>
## <td>Format:</td>
## <td>
## <xsl:value-of select="format"/>
## </td>
## </tr>
## <tr>
## <td>Label:</td>
## <td>
## <xsl:value-of select="varLabel"/>
## </td>
## </tr>
## </table>
## <xsl:for-each select="codes[.!='']">
## <table>
## <tr>
## <th>Value</th>
## <th>Label</th>
## </tr>
## <xsl:for-each select="pair">
## <tr>
## <td>
## <xsl:value-of select="val"/>
## </td>
## <td>
## <xsl:value-of select="codeLabel"/>
## </td>
## </tr>
## </xsl:for-each>
## </table>
## </xsl:for-each>
## <hr/>
## </xsl:for-each>
## </xsl:for-each>
## </body>
## </html>
## </xsl:template>
## </xsl:stylesheet>
##
The strategy here is to be mindful about when you want to nest inside of a node and when you just want to move on to the next node. The default behavior for addNode
and addTag
is to close the node and move on, so set close=FALSE
if you want to nest children within it, then follow with closeNode()
or closeTag()
when you’re done with the children. Alternatively, you can add children by nesting functions. For example:
cbxsl$addTag("tr",
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="val"))
),
cbxsl$addTag("td",
cbxsl$addNode("xsl:value-of", attrs=c(select="codeLabel"))
)
)
produces the following:
<tr>
<td>
<xsl:value-of select="val"/>
</td>
<td>
<xsl:value-of select="codeLabel"/>
</td>
</tr>
Note that it’s probably easier to build this structure by hand in RStudio; it’s handy with closing the tags for you and keeping track of your indents. But if you want to keep everything in one Rmd document, xmlTree
would be the way to go.
Compile
Merging the XML and XSL files into an HTML file is simple with the xml_xslt
function:
# Bring in the external xml codebook file
doc <- read_xml("OpenScienceAim1Codebook.xml")
# Convert the xsl object to an xml_document object
style <- read_xml(saveXML(cbxsl$value()))
# Merge to an HTML document
htmldoc <- xml_xslt(doc,style)
At this point, we could export the htmldoc
object as an HTML file and be done with it. However, if we’d like to take advantage of the formatting that knitr
does so easily when making HTML documents of our code and comments (like the document you’re reading now), we can simply incorporate the object in the current document without having an extra HTML file sitting around.
Here’s the YAML to produce an HTML document with the nifty Table of Contents (TOC) you see on the left side:
---
title: "XML Codebook Example"
output:
html_document:
toc: true
toc_float:
smooth_scroll: true
---
You can also set your Rmd file to not display any of the code to produce the XML, XSL, and HTML files by putting the following code chunk at the top of the document:
# Be sure to set {r include=FALSE} so the first chunk does not get displayed
knitr::opts_chunk$set(echo = FALSE)
The code below will seamlessly integrate the htmldoc
object, even including the variable names in the TOC. Remember how we coded those within <h2>
tags? This is why!
# The next two lines allow the "includeHTML" function to use the
# htmldoc object, which is otherwise optimized to import external HTML files.
fp <- file()
cat(as.character(htmldoc), file=fp)
# Plug in the HTML codebook
includeHTML(fp)
Open Science Aim 1 Codebook
Project contact person: Jenine Harris, harrisj@wustl.edu
Project description: Surveyed public health practitioners on reproducible research practices
Sample and survey procedures: Email invitation for a web-based survey to members of the American Public Health Association statistics section, posted web link on project Twitter feed coding2share, RWJF Twitter feed, and JPHMP Twitter feed
Sample size: 247 participants; 207 complete surveys
Data collection time period: Sept-Dec 2017
ID
Source
Format: |
labelled |
Label: |
Source |
Value |
Label |
1 |
Email |
2 |
JPHMP |
3 |
RWJF |
StartDate
Format: |
POSIXctPOSIXt |
Label: |
StartDate |
EndDate
Format: |
POSIXctPOSIXt |
Label: |
EndDate |
Finished
Format: |
numeric |
Label: |
Finished |
Q2
Format: |
numeric |
Label: |
Research Reproducibility in Public Health |
Q3
Format: |
numeric |
Label: |
Introduction |
Q4_1
Format: |
numeric |
Label: |
Primary job responsibilities-Data collection |
Q4_2
Format: |
numeric |
Label: |
Primary job responsibilities-Data management |
Q4_3
Format: |
numeric |
Label: |
Primary job responsibilities-Descriptive analysis |
Q4_4
Format: |
numeric |
Label: |
Primary job responsibilities-Inferential analysis |
Q4_5
Format: |
numeric |
Label: |
Primary job responsibilities-Visualization/graph production |
Q4_6
Format: |
numeric |
Label: |
Primary job responsibilities-Contributing to publications or reports |
Q5
Format: |
labelled |
Label: |
Publication work |
Q7_1_1
Format: |
numeric |
Label: |
Data and code storage-Data-In the cloud (e.g., Dropbox, GitHub) |
Q7_1_2
Format: |
numeric |
Label: |
Data and code storage-Data-On a local server at your workplace |
Q7_1_3
Format: |
numeric |
Label: |
Data and code storage-Data-On a desktop computer |
Q7_1_4
Format: |
numeric |
Label: |
Data and code storage-Data-On a laptop computer |
Q7_1_5
Format: |
numeric |
Label: |
Data and code storage-Data-On a removable storage device (e.g., USB drive, CD) |
Q7_1_6
Format: |
numeric |
Label: |
Data and code storage-Data-There were no statistical code files (e.g., used point-and-click/menu driven approach in SPSS, Excel, or other software) |
Q7_2_1
Format: |
numeric |
Label: |
Data and code storage-Statistical code-In the cloud (e.g., Dropbox, GitHub) |
Q7_2_2
Format: |
numeric |
Label: |
Data and code storage-Statistical code-On a local server at your workplace |
Q7_2_3
Format: |
numeric |
Label: |
Data and code storage-Statistical code-On a desktop computer |
Q7_2_4
Format: |
numeric |
Label: |
Data and code storage-Statistical code-On a laptop computer |
Q7_2_5
Format: |
numeric |
Label: |
Data and code storage-Statistical code-On a removable storage device (e.g., USB drive, CD) |
Q7_2_6
Format: |
numeric |
Label: |
Data and code storage-Statistical code-There were no statistical code files (e.g., used point-and-click/menu driven approach in SPSS, Excel, or other software) |
Q8_1_1
Format: |
numeric |
Label: |
Files storage location-Data-In project-specific folders or directories |
Q8_1_2
Format: |
numeric |
Label: |
Files storage location-Data-In folders or directories not specific to the project |
Q8_2_1
Format: |
numeric |
Label: |
Files storage location-Statistical code-In project-specific folders or directories |
Q8_2_2
Format: |
numeric |
Label: |
Files storage location-Statistical code-In folders or directories not specific to the project |
Q9_1
Format: |
labelled |
Label: |
Files organization-Data |
Value |
Label |
1 |
In a single file |
2 |
In multiple files |
Q9_2
Format: |
labelled |
Label: |
Files organization-Statistical code |
Value |
Label |
1 |
In a single file |
2 |
In multiple files |
Q10_1_1
Format: |
numeric |
Label: |
Files accessibility-Data-Shareable with anyone (public) |
Q10_1_2
Format: |
numeric |
Label: |
Files accessibility-Data-Shareable but secure (i.e., authorized access only) |
Q10_1_3
Format: |
numeric |
Label: |
Files accessibility-Data-Private |
Q10_2_1
Format: |
numeric |
Label: |
Files accessibility-Statistical code-Shareable with anyone (public) |
Q10_2_2
Format: |
numeric |
Label: |
Files accessibility-Statistical code-Shareable but secure (i.e., authorized access only) |
Q10_2_3
Format: |
numeric |
Label: |
Files accessibility-Statistical code-Private |
Q11_1
Format: |
labelled |
Label: |
Location ease-Data |
Value |
Label |
1 |
Very easy |
2 |
Somewhat easy |
3 |
Somewhat difficult |
4 |
Very difficult |
Q11_2
Format: |
labelled |
Label: |
Location ease-Statistical code |
Value |
Label |
1 |
Very easy |
2 |
Somewhat easy |
3 |
Somewhat difficult |
4 |
Very difficult |
Q12_1
Format: |
labelled |
Label: |
Files storage process-Data |
Value |
Label |
1 |
No set process |
2 |
A set process developed in-house |
3 |
A set process developed externally |
Q12_2
Format: |
labelled |
Label: |
Files storage process-Statistical code |
Value |
Label |
1 |
No set process |
2 |
A set process developed in-house |
3 |
A set process developed externally |
Q13
Format: |
numeric |
Label: |
Code annotation and documentation |
Q14
Format: |
labelled |
Label: |
Variable dictionary |
Q15
Format: |
labelled |
Label: |
Prolog |
Q16_1
Format: |
numeric |
Label: |
Prolog contents-Project name |
Q16_2
Format: |
numeric |
Label: |
Prolog contents-Code developer name |
Q16_3
Format: |
numeric |
Label: |
Prolog contents-Contact information for the code developer or another author |
Q16_4
Format: |
numeric |
Label: |
Prolog contents-The purpose of the code |
Q16_5
Format: |
numeric |
Label: |
Prolog contents-Information about the language/program used to run the code (e.g., software package, version of language) |
Q16_6
Format: |
numeric |
Label: |
Prolog contents-File paths or other locating information for the data |
Q16_7
Format: |
numeric |
Label: |
Prolog contents-File paths or other locating information for the statistical code |
Q16_8
Format: |
numeric |
Label: |
Prolog contents-The last date the code was edited |
Q16_9
Format: |
numeric |
Label: |
Prolog contents-Something else (please specify): |
Q16_9_TEXT
Format: |
character |
Label: |
Prolog contents-Something else (please specify):-TEXT |
Q17
Format: |
labelled |
Label: |
Statistical code comments |
Q18
Format: |
labelled |
Label: |
Code comment comprehensiveness |
Value |
Label |
1 |
Comprehensive: most or all steps of analysis have comments |
2 |
Moderate: comments used regularly throughout the code for new sections and clarification |
3 |
Sparse: few or no comments used |
Q19
Format: |
labelled |
Label: |
Guidelines |
Value |
Label |
1 |
Yes, closely followed guidelines |
2 |
Yes, partially followed guidelines |
3 |
No |
Q20_1
Format: |
numeric |
Label: |
Specific guidelines-Google's R Style Guide |
Q20_2
Format: |
numeric |
Label: |
Specific guidelines-Bioconductor's Coding Standards (R) |
Q20_3
Format: |
numeric |
Label: |
Specific guidelines-Hadley Wickham's Style Guide (R) |
Q20_4
Format: |
numeric |
Label: |
Specific guidelines-Henrik Bengtsson's R Coding Conventions |
Q20_5
Format: |
numeric |
Label: |
Specific guidelines-Colin Gillespie's R Style Guide |
Q20_6
Format: |
numeric |
Label: |
Specific guidelines-SAS Style Guide |
Q20_7
Format: |
numeric |
Label: |
Specific guidelines-Guidelines for Coding of SAS Programs |
Q20_8
Format: |
numeric |
Label: |
Specific guidelines-Suggestions on Stata Programming Style |
Q20_9
Format: |
numeric |
Label: |
Specific guidelines-Google Python Style Guide |
Q20_10
Format: |
numeric |
Label: |
Specific guidelines-GNU Mailman Style Guide (Python) |
Q20_11
Format: |
numeric |
Label: |
Specific guidelines-Style Guide for Python Code: PEP 8 |
Q20_12
Format: |
numeric |
Label: |
Specific guidelines-Visual Basic Coding Conventions (Excel, Access) |
Q20_13
Format: |
numeric |
Label: |
Specific guidelines-MATLAB Style Guidelines |
Q20_14
Format: |
numeric |
Label: |
Specific guidelines-Something else (please specify): |
Q20_14_TEXT
Format: |
character |
Label: |
Specific guidelines-Something else (please specify):-TEXT |
Q21_1
Format: |
numeric |
Label: |
Reasons for following guidelines-Policy of my research team/lab |
Q21_2
Format: |
numeric |
Label: |
Reasons for following guidelines-Required for publication |
Q21_3
Format: |
numeric |
Label: |
Reasons for following guidelines-Makes my life easier |
Q21_4
Format: |
numeric |
Label: |
Reasons for following guidelines-It produces better code |
Q21_5
Format: |
numeric |
Label: |
Reasons for following guidelines-I was taught this way |
Q21_6
Format: |
numeric |
Label: |
Reasons for following guidelines-Improves collaboration |
Q21_7
Format: |
numeric |
Label: |
Reasons for following guidelines-Increases reproducibility |
Q21_8
Format: |
numeric |
Label: |
Reasons for following guidelines-Something else (please specify): |
Q21_8_TEXT
Format: |
character |
Label: |
Reasons for following guidelines-Something else (please specify):-TEXT |
Q22_1
Format: |
numeric |
Label: |
Code formatting-Used nouns for variables and/or verbs for functions |
Q22_2
Format: |
numeric |
Label: |
Code formatting-Limited lines of code to a certain length (e.g., 80 characters maximum) |
Q22_3
Format: |
numeric |
Label: |
Code formatting-Included metadata, such as the date or project name, in the file title |
Q22_4
Format: |
numeric |
Label: |
Code formatting-Included seed values for analyses that included randomness |
Q22_5
Format: |
numeric |
Label: |
Code formatting-Used a consistent way to name variables and functions (e.g., all lower case, camel case, meaningful words, etc) |
Q22_6
Format: |
numeric |
Label: |
Code formatting-Separated analysis steps with white space or blank lines |
Q22_7
Format: |
numeric |
Label: |
Code formatting-Used indentation to group lines of code within procedures/functions |
Q22_8
Format: |
numeric |
Label: |
Code formatting-Wrote functions for tasks repeated multiple times |
Q22_9
Format: |
numeric |
Label: |
Code formatting-Included some results within the annotation (e.g., mean age = 39, outcome did not meet normality assumption) |
Q22_10
Format: |
numeric |
Label: |
Code formatting-Integrated code with text and results into a single output, also known as literate programming (e.g., Sweave, Jupyter Notebook, R Markdown) |
Q22_11
Format: |
numeric |
Label: |
Code formatting-Something else (please specify): |
Q22_11_TEXT
Format: |
character |
Label: |
Code formatting-Something else (please specify):-TEXT |
Q22_12
Format: |
numeric |
Label: |
Code formatting-None of the above |
Q23
Format: |
numeric |
Label: |
Reproducibility of statistical analyses |
Q24_1
Format: |
numeric |
Label: |
Publication contents-Software used |
Q24_2
Format: |
numeric |
Label: |
Publication contents-Units of analysis (e.g., people, health departments) |
Q24_3
Format: |
numeric |
Label: |
Publication contents-Details on missing data handling |
Q24_4
Format: |
numeric |
Label: |
Publication contents-Variable recoding details |
Q24_5
Format: |
numeric |
Label: |
Publication contents-The name, units, and types of variable analyzed |
Q24_6
Format: |
numeric |
Label: |
Publication contents-The statistical approaches used (e.g., logistic regression, mixed-effects regression) |
Q24_7
Format: |
numeric |
Label: |
Publication contents-The type of test statistics computed (e.g., chi-squared, F) |
Q24_8
Format: |
numeric |
Label: |
Publication contents-The value of test statistics |
Q24_9
Format: |
numeric |
Label: |
Publication contents-The specific variables included in each statistical model |
Q24_10
Format: |
numeric |
Label: |
Publication contents-Sample sizes for each analysis |
Q24_11
Format: |
numeric |
Label: |
Publication contents-Precise p-values when possible (e.g., p=.02 rather than p |
Q24_12
Format: |
numeric |
Label: |
Publication contents-None of the above |
Q25_1_1
Format: |
numeric |
Label: |
Files created/made publicly available-Created-A clean version of data used for analyses |
Q25_1_2
Format: |
numeric |
Label: |
Files created/made publicly available-Created-Clean statistical code for results included in the publication |
Q25_1_3
Format: |
numeric |
Label: |
Files created/made publicly available-Created-A project directory with data and/or statistical code used for the publication |
Q25_1_4
Format: |
numeric |
Label: |
Files created/made publicly available-Created-A readme file explaining the data and/or statistical code |
Q25_1_5
Format: |
numeric |
Label: |
Files created/made publicly available-Created-None of the above |
Q25_2_1
Format: |
numeric |
Label: |
Files created/made publicly available-Made publicly available-A clean version of data used for analyses |
Q25_2_2
Format: |
numeric |
Label: |
Files created/made publicly available-Made publicly available-Clean statistical code for results included in the publication |
Q25_2_3
Format: |
numeric |
Label: |
Files created/made publicly available-Made publicly available-A project directory with data and/or statistical code used for the publication |
Q25_2_4
Format: |
numeric |
Label: |
Files created/made publicly available-Made publicly available-A readme file explaining the data and/or statistical code |
Q25_2_5
Format: |
numeric |
Label: |
Files created/made publicly available-Made publicly available-None of the above |
Q26
Format: |
labelled |
Label: |
Have you ever made your statistical code publicly available? |
Q27
Format: |
labelled |
Label: |
Have you ever published with public data? |
Q28_1_1
Format: |
numeric |
Label: |
Required to make publicly available?-Data-Yes, by a funder |
Q28_1_2
Format: |
numeric |
Label: |
Required to make publicly available?-Data-Yes, by the journal |
Q28_1_3
Format: |
numeric |
Label: |
Required to make publicly available?-Data-Yes, by my employer |
Q28_1_4
Format: |
numeric |
Label: |
Required to make publicly available?-Data-Yes, by my research team |
Q28_1_5
Format: |
numeric |
Label: |
Required to make publicly available?-Data-No, it was not required |
Q28_2_1
Format: |
numeric |
Label: |
Required to make publicly available?-Statistical code-Yes, by a funder |
Q28_2_2
Format: |
numeric |
Label: |
Required to make publicly available?-Statistical code-Yes, by the journal |
Q28_2_3
Format: |
numeric |
Label: |
Required to make publicly available?-Statistical code-Yes, by my employer |
Q28_2_4
Format: |
numeric |
Label: |
Required to make publicly available?-Statistical code-Yes, by my research team |
Q28_2_5
Format: |
numeric |
Label: |
Required to make publicly available?-Statistical code-No, it was not required |
Q29
Format: |
labelled |
Label: |
Code development |
Value |
Label |
1 |
The code was developed by one person working alone |
2 |
The code was developed by one person and other people checked the code (co-pilot strategy) |
3 |
Code was developed in separate files by two or more people independently and then compared before choosing one or a combination of the two (i.e., parallel code development) |
4 |
Code was developed by two or more people working together on the same coding file |
Q30
Format: |
numeric |
Label: |
Research reproducibility facilitators and barriers |
Q31_1
Format: |
numeric |
Label: |
Reproducible research facilitators-Additional human and financial resources |
Q31_2
Format: |
numeric |
Label: |
Reproducible research facilitators-Training on reproducible research practices |
Q31_3
Format: |
numeric |
Label: |
Reproducible research facilitators-Requirements by funders to disseminate data and statistical code |
Q31_4
Format: |
numeric |
Label: |
Reproducible research facilitators-Requirements by journals to include access to data and statistical code |
Q31_5
Format: |
numeric |
Label: |
Reproducible research facilitators-Workplace incentives (e.g., pay increases or more credit toward tenure for reproducible/reproduced work) |
Q31_6
Format: |
numeric |
Label: |
Reproducible research facilitators-Something else (please specify): |
Q31_6_TEXT
Format: |
character |
Label: |
Reproducible research facilitators-Something else (please specify):-TEXT |
Q32_1
Format: |
numeric |
Label: |
Reproducible research barriers-Lack of time |
Q32_2
Format: |
numeric |
Label: |
Reproducible research barriers-Lack of knowledge/training on reproducible research practices |
Q32_3
Format: |
numeric |
Label: |
Reproducible research barriers-Data privacy |
Q32_4
Format: |
numeric |
Label: |
Reproducible research barriers-Intellectual property concerns |
Q32_5
Format: |
numeric |
Label: |
Reproducible research barriers-Professional competition |
Q32_6
Format: |
numeric |
Label: |
Reproducible research barriers-Concerns of errors being discovered |
Q32_7
Format: |
numeric |
Label: |
Reproducible research barriers-Lack of incentive |
Q32_8
Format: |
numeric |
Label: |
Reproducible research barriers-Something else (please specify): |
Q32_8_TEXT
Format: |
character |
Label: |
Reproducible research barriers-Something else (please specify):-TEXT |
Q32_9
Format: |
numeric |
Label: |
Reproducible research barriers-No barriers experienced |
Q32_10
Format: |
numeric |
Label: |
Reproducible research barriers-I have not tried to make data or statistical code available |
Q33
Format: |
labelled |
Label: |
If training were available on reproducible research practices, how likely are you to participate? |
Value |
Label |
1 |
Very likely |
2 |
Somewhat likely |
3 |
Somewhat unlikely |
4 |
Very unlikely |
Q34_1
Format: |
numeric |
Label: |
Training format preference-In-person workshop or short course |
Q34_2
Format: |
numeric |
Label: |
Training format preference-In-person one-on-one assistance |
Q34_3
Format: |
numeric |
Label: |
Training format preference-Online asynchronous course |
Q34_4
Format: |
numeric |
Label: |
Training format preference-Online live webinar |
Q34_5
Format: |
numeric |
Label: |
Training format preference-Documents (e.g., manual, collection of resources, checklists, etc) |
Q34_6
Format: |
numeric |
Label: |
Training format preference-Blog or other living web presence |
Q34_7
Format: |
numeric |
Label: |
Training format preference-Social media site |
Q34_8
Format: |
numeric |
Label: |
Training format preference-Something else (please specify): |
Q34_8_TEXT
Format: |
character |
Label: |
Training format preference-Something else (please specify):-TEXT |
Q35
Format: |
numeric |
Label: |
Participant characteristics |
Q36
Format: |
labelled |
Label: |
Identify the statistical software you use the most often: |
Value |
Label |
1 |
Excel |
2 |
M-Plus |
3 |
MatLab |
4 |
Python |
5 |
R |
6 |
SAS |
7 |
SPSS |
8 |
Stata |
9 |
Other (please specify) |
Q36_TEXT
Format: |
character |
Label: |
Identify the statistical software you use the most often:-TEXT |
Q37_1
Format: |
numeric |
Label: |
Top 2 resources-Software user manual(s) |
Q37_2
Format: |
numeric |
Label: |
Top 2 resources-Textbook or reference book |
Q37_3
Format: |
numeric |
Label: |
Top 2 resources-Internet search (i.e. Google) |
Q37_4
Format: |
numeric |
Label: |
Top 2 resources-YouTube video demonstration or something similar |
Q37_5
Format: |
numeric |
Label: |
Top 2 resources-In-person one-on-one help |
Q37_6
Format: |
numeric |
Label: |
Top 2 resources-Email help |
Q37_7
Format: |
numeric |
Label: |
Top 2 resources-Help via a chat feature (gchat or similar) |
Q37_8
Format: |
numeric |
Label: |
Top 2 resources-Online training or course |
Q37_9
Format: |
numeric |
Label: |
Top 2 resources-In-person training or course |
Q37_10
Format: |
numeric |
Label: |
Top 2 resources-Websites, blogs (i.e. Stack Exchange) |
Q37_11
Format: |
numeric |
Label: |
Top 2 resources-Something else (specify): |
Q37_11_TEXT
Format: |
character |
Label: |
Top 2 resources-Something else (specify):-TEXT |
Q38
Format: |
labelled |
Label: |
Degree |
Value |
Label |
1 |
PhD |
2 |
MD |
3 |
MD/PhD |
4 |
DrPH |
5 |
DSc |
6 |
EdD |
7 |
MPH |
8 |
MSW |
9 |
MS or MSc |
10 |
MRes or MPhil |
11 |
MA |
12 |
MBA |
13 |
MAT |
14 |
BS |
15 |
BA |
16 |
Something else (please specify): |
Q38_TEXT
Format: |
character |
Label: |
Degree-TEXT |
Q39
Format: |
character |
Label: |
Degree field |
Q40
Format: |
character |
Label: |
Job title |
Q41
Format: |
labelled |
Label: |
How long have you been in your current position? |
Value |
Label |
1 |
Less than 1 year |
2 |
1-3 years |
3 |
4-10 years |
4 |
More than 10 years |
Q42
Format: |
labelled |
Label: |
What type of organization are you in? |
Value |
Label |
1 |
University/school |
2 |
For-profit |
3 |
Government |
4 |
Nonprofit |
5 |
Something else (please specify): |
Q42_TEXT
Format: |
character |
Label: |
What type of organization are you in?-TEXT |
Q43
Format: |
labelled |
Label: |
Gender |
Value |
Label |
1 |
Male |
2 |
Female |
3 |
Transgender |
4 |
Do not identify as female, male, or transgender |
Q44_1
Format: |
numeric |
Label: |
Race/ethnicity-White |
Q44_2
Format: |
numeric |
Label: |
Race/ethnicity-Hispanic, Latino, or Spanish origin |
Q44_3
Format: |
numeric |
Label: |
Race/ethnicity-Black or African American |
Q44_4
Format: |
numeric |
Label: |
Race/ethnicity-Asian |
Q44_5
Format: |
numeric |
Label: |
Race/ethnicity-American Indian or Alaska Native |
Q44_6
Format: |
numeric |
Label: |
Race/ethnicity-Middle Eastern or North African |
Q44_7
Format: |
numeric |
Label: |
Race/ethnicity-Native Hawaiian or Other Pacific Islander |
Q44_8
Format: |
numeric |
Label: |
Race/ethnicity-Some other race, ethnicity, or origin |
Q45
Format: |
numeric |
Label: |
Age |
Q46
Format: |
character |
Label: |
Suggestions |
Q47
Format: |
numeric |
Label: |
Thanks message |