X
  • Home
  • About us
    • About CLOSER
      Our aim is to maximise the use, value and impact of UK’s longitudinal studies.
    • Our partners
      Find out more about our partner organisations.
    • Explore the studies
      Find out more about the longitudinal studies in CLOSER.
    • Our funders
      CLOSER is funded by the ESRC.
    • Who we are
      Meet the academics and professionals managing CLOSER.
    • Our Communities of Practice
      Find out more about our Communities of Practice.
  • Our studies
    • Explore the studies
      In this section you can find out more about the longitudinal studies that are part of CLOSER.
    • Timeline of studies
      View the interactive timeline of studies.
  • Search our data
    • How to access the data
      Learn how to access the data from the CLOSER studies
    • CLOSER Discovery
      Start using our search engine.
  • Research we fund
    • About the research we fund
      We fund interdisciplinary research across the major longitudinal studies.
    • Data harmonisation
      We are working to make the data from our studies more comparable.
    • Data linkage
      We are helping to link data from our studies to administrative records.
    • world icon
      CLOSER International
      Our CLOSER International project aims to promote the value of longitudinal studies in low and middle-income countries.
    • Other research we fund
      We fund other research in the biomedical and social sciences.
  • News & blogs
    • News
      The latest news from the home of longitudinal research.
    • Blogs
      Opinion, analysis and discussion from the UK’s longitudinal studies community.
  • Events & training
    • Upcoming events & training
      See what events and training we have in the coming months.
    • Previous events
      Take a look at our previous events and associated resources.
    • CLOSER Learning Hub
      Find out about our online learning resource for those new to longitudinal studies.
  • Our resources
    • COVID-19 Longitudinal Research Hub
      Explore COVID-19 related activity including new surveys, data releases and findings from the longitudinal research community
    • Cross-study data guides
      Explore our interactive guides to discover measures used to assess cognition, physical activity, and diet in a number of UK longitudinal studies.
    • Our longitudinal resources
      Access our publications library and links to external resources.
    • Our longitudinal evidence
      Explore findings from our longitudinal studies.
    • Our changing society
      Information about the historical and political backdrop to cohort members’ lives.
    • Preparing for the future of longitudinal science
      Learn how CLOSER can help the UK’s longitudinal population studies meet the challenges facing them in the future.

Search
Menu

Home  >  Blogs  >  Automating the documentation of data transformation processes

Automating the documentation of data transformation processes

16 July 2020

Scoop.it ShareThis

Blog

George Alter, Research Professor and former Director of ICPSR, writes about the new standard (SDTL) coming out of DDI and its relevance to improving the comprehensibility of harmonisation research.

A hallmark of research transparency in the social sciences is sharing both program code and data used in publications. Economists began discussing the importance of sharing program code more than twenty years ago. Leading journals in economics, political science, and education research now encourage or require authors to share their code. This is an important gain for the movement toward reproducible research, but there are important limits to current practices.

First, authors are only required to deposit “replication data” (data for the replication of those results in the specific paper) and program code used to produce their published results. They are not asked to provide detailed information about data processing leading up to the replication data. Replication data policies are a reasonable accommodation to authors who do not want to share data that they plan to analyze in the future. However important steps in creating the replication data file, such as sample selection criteria, may be poorly documented. When a publication is based on publicly available data at a trusted digital repository (like the UK Data Service and ICPSR), there is no reason for journals to collect “replication data.” Program code deposited with the journal should (where confidentiality and data governance allows) reproduce the published results by starting with the original data. The proliferation of “replication data” tempts subsequent researchers to re-use derived datasets without re-examining the sample selection and processing procedures that created them.

Second, data processing code is not easy to understand. Scripts often run to hundreds or thousands of lines. Finding all of the commands that affected a specific variable may be very difficult. Moreover, every statistical analysis package has a different language. Even though SAS and Python have similar capabilities, their languages are about as similar as English and Greek.

Third, everyone knows that program code should include comments and other documentation, but few of us document our code as well as we should. Writing comments is tedious and time consuming, especially when the code may only be used once.

The “Continuous Capture of Metadata for Statistical Data” (C2Metadata) Project (National Science Foundation grant ACI-1640575) is addressing these problems by automating the creation of metadata about data transformation scripts. Our software applications read scripts from the five most common statistical analysis packages (SPSS, SAS, Stata, R, and Python) and convert them into a common format called Structured Data Transformation Language (SDTL). SDTL can be inserted into standard metadata files, like DDI (Data Documentation Initiative), which is used by social science data repositories. We also have software that translates SDTL into natural language, so that a data user does not need to understand the statistical analysis package that performed the transformations. Codebooks and data catalogues can now include “variable lineages” that describe every data transformation command that affected each variable. SDTL is also easy for machines to read and query. We can ask which antecedent variables affected a specific derived variable, or which derived variables were affected by a specific program step.

When these tools are linked in a web service or desktop application, a researcher can submit a data transformation script along with an original metadata file and receive a codebook with annotations about each derived variable.  In the background, C2Metadata tools analyze the script, translate it into SDTL, insert SDTL into an XML metadata file, and pass the XML to a program that produces a codebook or other documentation.  Most researchers will never see the SDTL version of their command script, because the software translates SDTL into a human-readable description of the transformed data file.

SDTL is currently under review for adoption by the DDI Alliance into its suite of international standards.  SDTL version 1.0 is available for comments here.

All software from the C2Metadata Project will be available under an open source license in a public repository. We expect that applications developed by the project will be incorporated into user-friendly tools.

Further information

  • Check out the C2Metadata Project
  • Find out more about ICPSR
  • Contact
  • Jobs
  • Privacy policy
  • Cookies
Back to top

CLOSER
UCL Social Research Institute
55-59 Gordon Square
London WC1H 0NU

map

TEL +44 (0)20 7612 6938 EMAIL closer@ucl.ac.uk

social media

Love longitudinal?

Sign up for our email newsletters to get the latest from CLOSER

Sign up
© 2018 CLOSER. All rights reserved.