Module 2: Strategies to make your private data shareable
Research papers with shared data have fewer errors, those with shared data or shared code are cited more, and shared data are used in more publications.1,2 Research involving human subjects often has restrictions on data sharing due to the nature of the information being collected. Often research like this will only be approved if data files are password protected, encrypted, or otherwise stored securely and not available outside the project team.
If this is the case for your research, there are two possible strategies that could allow publicly sharing data without releasing private information:
Follow the instructions and examples below to de-identify or create synthetic data. To be safe, check with your human subjects review board first to make sure the strategy you choose meets the data sharing limitations for your project. Once you have de-identified or created synthetic data, share it publicly through GitHub by following the steps in our GitHub module.
How to de-identify data
De-identified data is data with personally identifiable information (PII) removed so that there is not enough information to identify any individual person. Professionals in public health and related fields are concerned with one type of PII, protected health information (PHI).
The U.S. Department of Health and Human Services (HHS) provides guidance on de-identifying data files. According to the HHS, there are two ways to de-identify data so that it does not included PHI: safe harbor and expert determination.
The safe harbor method requires removal from the data source of the following for the individual and their relatives, employers, or household members:
Names
Geographic regions smaller than a state
Dates other than year
Telephone numbers
Fax numbers
Email addresses
Social security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/license numbers
Vehicle identifiers and serial numbers, including license plate numbers
Device identifiers and serial numbers
Web universal resource locators (URLs)
Internet protocol (IP) addresses
Biometric identifiers, including finger and voice prints
Full-face photographs and any comparable images
Any other unique identifying number, characteristic, or code
The expert determination method requires that information in the data cannot be used to identify individuals by someone with skills and knowledge of common analytic methods and access to other reasonably available data (e.g., voter registration records). A more detailed description of this method is available on the HHS website.
We strongly recommend visiting the HHS page before sharing your de-identified data to ensure you meet either the safe harbor or the expert determination requirements for de-identification. Several websites offer ideas and training that might also be useful if you are working on this:
We also encourage you to check with your Institutional Review Board before making de-identified data public to ensure you are meeting any requirements to protect human subjects.
How to create synthetic data
When de-identification is difficult, impossible, or not ideal, creating synthetic data that keeps important variables and relationships from the original data may be an option. Similar in concept to imputation, creating synthetic data entails replacing some or all observations using appropriate probability distributions to ensure important properties and relationships are maintained.3
Please note that the examples shown below are fairly simple and do not include maintaining complex relationships among variables. When creating synthetic data for distribution, review the resources listed below and thoroughly test the synthetic data to make sure it maintains logical and relevant properties including complex relationships important to the research.
R
As one example, the synthpop package in R can be used to create a synthetic data set. Using the publicly available National Health and Nutrition Examination Survey (NHANES), here is an example of creating a synthetic data set in R. Note that you will choose a seed value to enter into the code; the syn procedure will use this seed as a starting point for the simulations that generate the data. Using a set seed rather than permitting the algorithm to choose a random number ensures that the synthetic data source can be reproduced by others.
# bring in the original data name it nhanes
# this is the 2011-2012 NHANES data available from the
# Centers for disease control and prevention
library(RNHANES)
nhanes <- nhanes_load_data("AUQ_G", "2011-2012", demographics = TRUE)
# make a smaller version of nhanes with the variables of interest
vars <- c("RIAGENDR", "RIDAGEYR", "RIDRETH1")
nhanes.small <- nhanes[, vars]
# open the synthpop library (install first)
library(synthpop)
# create nhanes.synth with default method
# use seed value for reproducibility
nhanes.synth.list <- syn(nhanes.small, seed=123456)
Synthesis
-----------
RIAGENDR RIDAGEYR RIDRETH1
nhanes.synth <- nhanes.synth.list$syn
# example of comparing the original and synthetic data sets
# add labels to gender and ethnicity variables
# see codebook if needed
nhanes.small$RIAGENDR <- factor(nhanes.small$RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female"))
nhanes.small$RIDRETH1 <- factor(nhanes.small$RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black",
"Other Race - Including Multi-Racial"))
nhanes.synth$RIAGENDR <- factor(nhanes.synth$RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female"))
nhanes.synth$RIDRETH1 <- factor(nhanes.synth$RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black",
"Other Race - Including Multi-Racial"))
# original
table(nhanes.small$RIDRETH1, nhanes.small$RIAGENDR)
Male Female
Mexican American 645 619
Other Hispanic 496 536
Non-Hispanic White 1449 1411
Non-Hispanic Black 1273 1315
Other Race - Including Multi-Racial 800 820
Male Female
Mexican American 663 606
Other Hispanic 498 550
Non-Hispanic White 1468 1403
Non-Hispanic Black 1229 1326
Other Race - Including Multi-Racial 793 828
The top table shows the original data frequencies for each category of race/ethnicity across the male and female sex categories. The bottom shows the same variables in the synthetic data set. While the two data sets are not identical, they are similar.
SAS (using R to create)
There are macros for creating synthetic data from SAS data sources, however, they are difficult to use. Likewise, the synth package for Stata (resource below) will create synthetic treatment observations, but the process is not straightforward for synthesizing datasets. The R synthpop package is one alternative since R can easily open, manipulate, and export SAS or Stata files. Here are the instructions from the R example above but with the SAS version of the data source as input and a SAS or Stata data file export command added:
# bring in the original data name it brfss
# this is the 2016 BRFSS data available from the
# Centers for disease control and prevention
# in SAS transport format
# install haven if it is not installed
# open the haven library for importing xpt files
library(haven)
# make a temporary object that holds a zip file
brfss_url <- tempfile(fileext = ".zip")
# download the zip file
# put the zip file in the temporary file
download.file("https://www.cdc.gov/brfss/annual_data/2016/files/LLCP2016XPT.zip", brfss_url)
# unzip the file and read it in using read_xpt
brfss <- read_xpt(brfss_url)
# make a smaller version of brfss with the variables of interest
vars <- c("SEX", "MARITAL", "EDUCA")
brfss.small <- brfss[, vars]
# open the synthpop library (install first)
library(synthpop)
# create brfss.synth with default method
# use seed value for reproducibility
brfss.synth.list <- syn(brfss.small, seed=123456)
Synthesis
-----------
SEX MARITAL EDUCA
brfss.synth <- brfss.synth.list$syn
# example of comparing the original and synthetic data sets
# add labels to gender and ethnicity variables
# see codebook if needed
brfss.small$SEX <- factor(brfss.small$SEX,
levels = c(1, 2, 9),
labels = c("Male", "Female", "Refused"))
brfss.small$MARITAL <- factor(brfss.small$MARITAL,
levels = c(1, 2, 3, 4, 5, 6, 9),
labels = c("Married", "Divorced",
"Widowed", "Separated",
"Never married", "A member of an unmarried couple",
"Refused"))
brfss.synth$SEX <- factor(brfss.synth$SEX,
levels = c(1, 2, 9),
labels = c("Male", "Female", "Refused"))
brfss.synth$MARITAL <- factor(brfss.synth$MARITAL,
levels = c(1, 2, 3, 4, 5, 6, 9),
labels = c("Married", "Divorced",
"Widowed", "Separated",
"Never married", "A member of an unmarried couple",
"Refused"))
# original
table(brfss.small$SEX, brfss.small$MARITAL)
Married Divorced Widowed Separated Never married
Male 119350 26170 12884 4051 39602
Female 134177 40213 48729 6103 36719
Refused 25 4 4 0 14
A member of an unmarried couple Refused
Male 7139 1409
Female 7795 1890
Refused 3 11
Married Divorced Widowed Separated Never married
Male 118986 26409 12963 3981 39522
Female 134286 40382 48755 6122 36620
Refused 26 6 6 0 13
A member of an unmarried couple Refused
Male 7232 1425
Female 7720 1827
Refused 3 4
The tables show that the synthetic data is not exactly the same as the original data, but it is close. Once the synthetic data are ready, use the haven package to write the new data to a SAS file:
# export to SAS file
# uses haven package opened earlier
# add path to write the data to inside the quote marks
write_sas(brfss.synth, path = "")
# export to Stata (*.dta) file
# uses haven package opened earlier
# add path to name and specify where to write the dataset to inside the quote marks
write_dta(brfss.synth, path = "")
SPSS
In SPSS, the SIMPLAN and SIMRUN functions can create synthetic data based on a real dataset. Use the small NHANES data created in the R example above as an example.
# Load haven library, which allows saving files to SPSS format
library(haven)
write_sav(nhanes.small, "C:/Your/Directory/NHANESorig.sav")
Run the following in an SPSS syntax editor:
* Open NHANES data saved out from R. GETFILE
= 'C:\Your\Directory\NHANESorig.sav'. DATASET NAME orig. DATASET ACTIVATE orig.
* Open simulated file and take a look. GETFILE =
'C:\Your\Directory\NHANESsynth.sav'. DATASET NAME synth. DATASET ACTIVATE synth.
Default procedures for the SIMPLAN command will automatically determine the best-fitting distributions for each variable included in the SIMINPUT subcommands, retain correlations between scale variables, and stop after 100,000 cases have been generated. A few modifications have been made for this data set:
Simulated values for age were beyond the human lifespan with default settings, so minimum and maximum values were set based on the age range in the real data. SPSS might print a warning, but it does produce the desired result.
SPSS does not automatically retain relationships between categorical variables. Specifying the CONTINGENCY subcommand will make that happen.
Given the large influence that sample size has on significance testing, manually limiting the maximum number of cases to the size of the real data is recommended using the STOPCRITERIA subcommand.
The SIMRUN command uses the plan created with SIMPLAN, applies it to the original data, and creates the simulated data set. To ensure the same result each time the plan is applied, specify the CRITERIA subcommand as shown with the same seed.
Go to Help --> Topics and search for SIMPLAN and SIMRUN for more specification details and options.
* Compare original and synthetic data on ethnicity/gender. DATASET ACTIVATE orig.
CROSSTABS /TABLES=RIDRETH1 BY RIAGENDR /CELLS=COUNT.
DATASET ACTIVATE synth.
CROSSTABS /TABLES=RIDRETH1 BY RIAGENDR /CELLS=COUNT.
Original data:
Simulated data:
While not exact, the simulated data show patterns similar to the original data.
Resources to use for creation of synthetic data in R, SPSS, Stata, and SAS are included in the Resources section below.
Resources
Data de-identification and synthetic data generation resources
1. McKiernan EC, Bourne PE, Brown CT, et al. How open science helps researchers succeed. eLife. 2016;5:e16800.
2. Piwowar HA, Day RS, Fridsma DB. Sharing detailed research data is associated with increased citation rate. PloS One. 2007;2(3):e308.
3. Shepherd BE, Peratikos MB, Rebeiro PF, Duda SN, McGowan CC. A pragmatic approach for reproducible research with sensitive data. Am J Epidemiol. 2017;186(4):387-392.