Module 4: Data documentation

Now that you have well-formatted code and data that are shareable, to ensure that these can be understood and efficiently used, you will need one or two additional documents:

A read-me file is typically focused on describing the files in a directory and how they might be opened and used. However, sometimes a read-me file also includes a project description and information on sampling, data collection, and how the variables were measured and coded.

Codebooks are typically focused on explaining how variables were measured and coded. Codebooks can also include more general information about the project, data collection, sampling, and other data-related details.

Depending on the guidance you follow, there can be a lot of overlap in the contents of read-me files and codebooks. If it suits your project, you might consider creating a single document that contains all the information recommended for both.

What to include in a read-me

Data and code shared publicly are open to people outside your research group. Others who find your data and code will likely be unfamiliar with important details like the project time frame and contact information for any questions. A read-me file is a plain text file that contains enough information to allow someone to use your data and statistical code and to contact the project team with question if needed.

This README.txt blog post has a nice summary of read-me contents consistent with most read-me guidance:

  • Project name
  • Date range
  • Project description
  • Funder
  • Contact information
  • File organization (files, folders, subfolders)
  • File naming
  • File storage location

For example, the read-me file for our data collection in this project would look like this:

Project name: coding2share 

Date range: 3/2017-9/2018

Project description: Surveyed public health practitioners on reproducible research practices 
Funder: Robert Wood Johnson Foundation

Contact information: Jenine Harris, harrisj@wustl.edu

File organization: The data, codebook, and statistical code are all together in the main directory

File naming: File names include date last saved, project name, and file type (022218_coding2share_data.csv)

File storage location: The files are all stored in the project repository at https://github.com/coding2share

What to include in a codebook

A codebook includes metadata for understanding and using a data set. Metadata is data-about-data like who collected the data, who funded data collection, how was the sample taken, how is each variable measured, etc. The Inter-university Consortium for Political and Social Research (ICPSR) summarized the metadata and variable information that should be included in a codebook.

We divided the metadata elements from the ICPSR into two groups: general information and variable information. More detail on each item can be found here.

General information to include in your codebook

There is a lot of detail that could be included as general project information in a codebook. Most resources agree that useful codebooks include a description of the study and information about the sampling, sample size, and timing of data collection. At a minimum:

  • Project contact person
  • Project description
  • Sampling and survey procedures
  • Sample size
  • Data collection time period

For example, the codebook for our data collection in this project would start with a section like this:

Project contact person: Jenine Harris, harrisj@wustl.edu

Project description: Surveyed public health practitioners on reproducible research practices

Sample and survey procedures: Email invitation for a web-based survey to members of the American Public Health Association statistics section, posted web link on project Twitter feed coding2share, RWJF Twitter feed, and JPHMP Twitter feed

Sample size: 247 participants; 207 complete surveys

Data collection time period: Sept-Dec 2017 

Variable information to include in your codebook

There is also a lot of detail that could be provided about each variable in a codebook. Most resources agree that a useful codebook includes as much of the following as applies:

  • Variable name
  • Variable label
  • Question text
  • Category values
  • Category value labels
  • Missing data values and labels
  • Relevant skip patterns and summary statistics

For example, in our survey we asked participants whether they had a codebook for the data from a recent paper or report. The codebook entry for this item would include:

Variable name: Q14
Variable label: Variable dictionary

Question text: A variable dictionary, or codebook, lists the variables in a data set and how they were measured and stored. Is there a variable dictionary for the data used for your recent publication or report?

Category values: 1, 2
Category value labels: 1-Yes, 2-No
Missing data values and labels: Blank

Relevant skip patterns and summary statistics: Question skipped for participants with no statistical code (Q7). Of 218 responses, 164 (75.2%) were Yes and 54 (24.8%) were No. 

Code to automatically generate a codebook

Some statistical packages have functions that use the properties of the variables in the dataset to provide a codebook in a formatted output. Assuming you have followed the best practices outlined in Module 3 and labeled your variables and their values properly, this method can automate much of the codebook writing process.

R examples

dataMaid Package

In R, the dataMaid package can be used to automatically generate a codebook in pdf format. To try the package, use one of the data sets automatically included with R like mtcars:

# dataMaid package for generating codebooks in R
# open package
library(dataMaid)

# load data
data(mtcars)

# make codebook
makeCodebook(mtcars)

The output from this shows a basic codebook, like this:

codebook Package

The codebook package in R can also automatically produce a codebook but as an html page. Additionally, it makes excellent use of any labeling you already have incorporated either in R or in SPSS or SAS while working with the haven package. It also interacts nicely with any Table of Contents settings you may have set up in your Rmd file.

To get a Table of Contents that will automatically line up with the codebook, include the following in the YAML of the Rmd:

---
title: "Automatic Codebook Example"
output: 
  html_document:
    toc: true
    toc_float: 
      smooth_scroll: true
---

In a chunk in the main body of the Rmd:

# Note: set the chunk option "include=FALSE" so that it doesn't appear in the final output
knitr::opts_chunk$set(echo=FALSE, warning=FALSE) # Don't show any additional 
# chunks or warnings
library(haven) # read/write SPSS files
library(codebook) # create codebook
dat <- read_spss("C:\\Wherever\\YouKeepStuff\\SPSSfileOfYourChoice.sav")
cb <- codebook(dat, survey_repetition="single")
cb

You can also include any additional text before or after your codebook chunk for introductory/summary sections, and whatever headings you use there will be incorporated into the TOC.

As you can see in a section of the output below, the package creates tabbed sections for each variable, including a frequency distribution, summary statistics, and variable and value labels if applicable.

SAS example using R code

There are macros available for SAS to create a codebook, however, we found them to be difficult to use and incomplete. Instead, try creating a codebook using dataMaid or codebook in R from your SAS data file by bringing in the data and using the dataMaid or codebook code as shown in the previous example.

The R code to bring SAS data and formats into R uses the haven package. Open R and use the following to download your SAS data into R:

# install and open haven package
install.packages("haven")
library(haven)

# bring in data
sasData <- read_sas("file path", "path to formats")

Once the data file is in, use dataMaid to create a codebook:

# install and open dataMaid for codebook development in R
install.packages("dataMaid")
library(dataMaid)

# make codebook
makeCodebook(sasData)

SPSS example

In SPSS, this example CODEBOOK function code in the syntax editor provides a codebook for two variables in a .spv file:

CODEBOOK Source Q14
/VARINFO LABEL TYPE MEASURE VALUELABELS MISSING
/STATISTICS COUNT PERCENT.

The output looks like this:

Note that in the case of SPSS, the percentages are out of the entire sample, not just the valid responses, accounting for the difference in % reported above.

An .spv file is only readable for those who have SPSS and it’s missing things like question text and skip pattern information, so a little more work is required to get it in shape:

  1. Copy the tables to a spreadsheet editor of your choice and clean up the formatting.
  2. Copy to a text editor of your choice that can handle tables and insert additional text.
  3. Save out as a PDF to make it readable to everyone.

Code to generate a custom codebook and machine-readable XML files

While the options for automatically generating codebooks work well enough, sometimes they’re overkill or don’t quite look the way you want them to. But, being automatic, they don’t give you much in the way of options. If you really want a custom look and/or also want to produce a machine-readable XML codebook file to aid in sharing your data, we have produced a submodule that will walk you through the process.

Beware: this involves a willingness to roll up your sleeves and get up to your eyeballs in HTML and XML tagging. While not for the faint of heart, you can make some cool-looking documents and demonstrate some serious coding chops.

Still want to give it a try? Take a look at the XML Codebook Example.




Reproducibility Toolkit on GitHub

Top