How metadata can drive new ways of doing research and collecting data

Doing science in the 21st century is a complicated business: more data, from different sources, each with their own unique characteristics, that take time to acquire and thought to combine. Very often, researchers spend a lot of time collating disparate pieces of information to begin to understand the data, even if they have been documented. With such effort involved, it is remarkable that there is any time left to do analysis.

Social science archives were established in the 1960s and 1970s in the USA and Europe to solve many of these problems. Data archivists quickly settled on common standards for a consistent set of information needed to support research. This includes not just the data, but also for the curation and extraction of metadata for discovery and evaluation purposes, and contextual information about the data collection, such as questionnaires and protocols.

Researchers need better access to information about data

Whilst the standards used and the information available has improved significantly over the years, the information provided to archives has not (in general). It continues to require large amounts of effort on the part of researchers to turn it into high quality research resources. One reason is because information about the data collection is very often detached from the datasets made available for research, and high quality data requires them to be better connected.

Enter CLOSER Discovery

The CLOSER Discovery search engine makes it easier for researchers to explore some of the most used longitudinal studies in the UK, by tying together rich layers of information about the data collected. Not only does this allow for discovery of content but also opens up new ways of managing and utilising questions and data that:

ease the manual and time consuming work that goes into planning research
bring efficiencies into the data collection process

Find questions and build your own surveys

Discovery lets you download questions (including code lists), shows exactly where they came from, and lets you create your own mini-question banks (similar to EndNote for questions) to use in your own questionnaires.

In addition, we will be adding functionality so you can also get the standardised HASSET or MESH concepts attached to the questions, and using our API, insert them automatically into a questionnaire editing tool, to track what changes you made and why. Tools such as the Questionnaire Design and Documentation Tool currently being used on the European Social Survey already have this functionality and we hope other tools will become available to design questionnaires and protocols.

Where we are able to get copyright, you will also be able to access the original questions from a published scale or battery of questions in a machine-readable format. So when you are looking at the data, there should be no more scrabbling around trying to work out whether the question was used as intended, or whether it was changed in some way.

Find variables – and where they come from

Discovery also allows you to not only see which question a variable came from, but what went into derived variables. This gives a clean lineage from data collection to output and derivation, and crucially the population from which it was taken, without laboriously cross-tabbing variables to try to understand what is going on, or manually searching through PDFs.

Find equivalent measures within and across studies

As we layer on more information, such as other questions that are capturing the equivalent measure, side-by-side comparison within studies and between studies also becomes possible. We will not make decisions about whether we think they are harmonisable. We will leave that up to you. But you can subset the variables you are interested in for your research question, be it on the variables alone, or refining it to just look at particular populations or modes of data collection, in a way that is just not very straight-forward even in the best question and variable databanks.

Although such rich information can be invaluable (especially if it was rolled out more widely), it can most likely never achieve the ease of use of a search engine like Google. CLOSER Discovery is more comparable to an enhanced bibliographic search, where you know what you want and the machine can facilitate this, but you will still have to apply your own judgement with your own criteria.

And it will just keep getting better…

The underlying structure of the standards we are using ‘future-proof’ Discovery to be able to cope with rapidly changing technology and new data collection methods. There are other exciting possibilities for using the CLOSER metadata underpinning Discovery, and the standards we employ. We are beginning to be in a position where we have sufficient information to start predicting new metadata, what the concepts and relationships may be, based on the content we receive. We can also potentially enhance and populate question banks, and validate incoming survey data. We will begin to explore these in more detail and report back on progress.

Jon Johnson is CLOSER’s Technical Lead. You can follow him on Twitter @spuddybike.

Suggested citation:
Johnson J (2019) “How metadata can drive new ways of doing research and collecting data”, CLOSER blog.

Related news and blogs

Discover the future
Dr Hayley Mills, CLOSER’s Senior Metadata Officer.