Automating the documentation of data transformation processes ShareThis


George Alter, Research Professor and former Director of ICPSR, writes about the new standard (SDTL) coming out of DDI and its relevance to improving the comprehensibility of harmonisation research.

A hallmark of research transparency in the social sciences is sharing both program code and data used in publications. Economists began discussing the importance of sharing program code more than twenty years ago. Leading journals in economics, political science, and education research now encourage or require authors to share their code. This is an important gain for the movement toward reproducible research, but there are important limits to current practices.

First, authors are only required to deposit “replication data” (data for the replication of those results in the specific paper) and program code used to produce their published results. They are not asked to provide detailed information about data processing leading up to the replication data. Replication data policies are a reasonable accommodation to authors who do not want to share data that they plan to analyze in the future. However important steps in creating the replication data file, such as sample selection criteria, may be poorly documented. When a publication is based on publicly available data at a trusted digital repository (like the UK Data Service and ICPSR), there is no reason for journals to collect “replication data.” Program code deposited with the journal should (where confidentiality and data governance allows) reproduce the published results by starting with the original data. The proliferation of “replication data” tempts subsequent researchers to re-use derived datasets without re-examining the sample selection and processing procedures that created them.

Second, data processing code is not easy to understand. Scripts often run to hundreds or thousands of lines. Finding all of the commands that affected a specific variable may be very difficult. Moreover, every statistical analysis package has a different language. Even though SAS and Python have similar capabilities, their languages are about as similar as English and Greek.

Third, everyone knows that program code should include comments and other documentation, but few of us document our code as well as we should. Writing comments is tedious and time consuming, especially when the code may only be used once.

The “Continuous Capture of Metadata for Statistical Data” (C2Metadata) Project (National Science Foundation grant ACI-1640575) is addressing these problems by automating the creation of metadata about data transformation scripts. Our software applications read scripts from the five most common statistical analysis packages (SPSS, SAS, Stata, R, and Python) and convert them into a common format called Structured Data Transformation Language (SDTL). SDTL can be inserted into standard metadata files, like DDI (Data Documentation Initiative), which is used by social science data repositories. We also have software that translates SDTL into natural language, so that a data user does not need to understand the statistical analysis package that performed the transformations. Codebooks and data catalogues can now include “variable lineages” that describe every data transformation command that affected each variable. SDTL is also easy for machines to read and query. We can ask which antecedent variables affected a specific derived variable, or which derived variables were affected by a specific program step.

When these tools are linked in a web service or desktop application, a researcher can submit a data transformation script along with an original metadata file and receive a codebook with annotations about each derived variable.  In the background, C2Metadata tools analyze the script, translate it into SDTL, insert SDTL into an XML metadata file, and pass the XML to a program that produces a codebook or other documentation.  Most researchers will never see the SDTL version of their command script, because the software translates SDTL into a human-readable description of the transformed data file.

SDTL is currently under review for adoption by the DDI Alliance into its suite of international standards.  SDTL version 1.0 is available for comments here.

All software from the C2Metadata Project will be available under an open source license in a public repository. We expect that applications developed by the project will be incorporated into user-friendly tools.

Further information