04/28/23

From Digital To Computational: Metadata For Reproducible Research

notes on structuring unstructured data with metadata frameworks

A few weeks ago, I wrote about digital health being a stepping stone to structure unstructured data, despite it resulting in unification over personalization in the short term. I’ve been thinking a lot about this in the context of metadata. Metadata refers to data that describes other data. It literally means “data about data”. Metadata can be found in a variety of contexts, including digital media files, scientific research data, web pages, etc. It can be used for a variety of purposes, such as helping to organize and classify information, providing search engines with information to help users find what they're looking for, and ensuring that data is properly formatted and compatible with different systems.

I read this thread by Christine that affirmed my thought about how metadata takes you to the next level in any subject area. She goes on to give examples of its application in different fields. 

Fashion: level 1 is wearing stuff you like, level 2 is having a sense of orientation in how what you wear lands within the context of different frames (kibbe, color seasons, aesthetics)

Directions: level 1 is having a general sense of the locations you drive through, level 2 is having a map in your head of how it all fits together

Cooking: level 1 is making recipes and trying stuff, level 2 is understanding how ingredients work together, common combinations, types of palates, etc

Metadata has the potential to transform a subject’s understanding from 2D to 3D. However structuring unstructured data with metadata can be challenging due to several issues

  1. Lack of standardization: There are many different metadata standards available, and they are often specific to different domains, such as libraries, archives, or scientific research. This can create confusion and inconsistency in metadata creation, leading to difficulties in interoperability and data integration.

  2. Incomplete or inconsistent metadata: Unstructured data often lacks a predefined schema or structure, making it difficult to describe with metadata. Incomplete or inconsistent metadata can lead to difficulties in data discovery, interpretation, and reuse.

  3. Complexity of data: Unstructured data can be complex and heterogeneous, containing multiple data types and formats, such as text, images, audio, and video. Structuring such data with metadata requires careful consideration of the relationships between different data elements and their context.

  4. Dynamic data: Unstructured data can be dynamic and changing, which makes it challenging to maintain accurate and up-to-date metadata. This is particularly relevant in fields such as scientific research, where data is often subject to updates and revisions over time.

Point here is, data is nothing without metadata. We also know that we need structured data to do anything. Metadata helps structure data. And to commercialize any computational research, structured data with the help of metadata needs to be reproducible at the very least. 

To address these issues, it is important to develop and adopt standardized metadata schemas, use automated metadata extraction and annotation tools where appropriate, and involve domain experts in the metadata creation process to ensure accuracy and completeness. It is also crucial to regularly review and update metadata to ensure it remains relevant and useful over time. 

Metadata can be used to structure unstructured data by providing additional information about the content and context of the data. Metadata can describe the data in terms of its format, structure, and content, as well as its relationships to other data and its intended use.

There are several ways in which metadata can be used to structure unstructured data:

  1. Data discovery: Metadata can be used to help users discover relevant data by providing information about the data's source, format, and content. This can make it easier to locate and access the data, especially if the data is stored in multiple locations or formats.

  2. Data integration: Metadata can be used to integrate data from different sources by providing information about how the data is related and how it can be combined. This can help to avoid duplication of effort and ensure that the data is consistent and accurate.

  3. Data processing: Metadata can be used to guide data processing by providing information about the structure and format of the data. This can help to ensure that the data is processed correctly and that the results are accurate.

  4. Data analysis: Metadata can be used to support data analysis by providing information about the context of the data, such as its source and intended use. This can help to ensure that the analysis is relevant and meaningful.

Metadata provides an important mechanism for structuring unstructured data and making it more usable and accessible. By providing additional context and information about the data, metadata can help to improve the accuracy, consistency, and relevance of the data for a wide range of applications. 

Metadata takes us from a world of static digital information to a dynamic computational realm, where data can be analyzed, processed, and manipulated in more sophisticated ways than ever before. 

Reproducible Computational Research

We can’t commercialize computational research if it is not reproducible. Despite our increased reliance on digital information, few studies have explicitly described how metadata enable reproducible computational research (RCR). 

RCR is fundamental in the scientific method, specifically for in silico analyses. Reproducibility of scientific studies has increased support for FAIR principles. FAIR data refers to data that adheres to the principles of Findability, Accessibility, Interoperability, and Reusability (FAIR). These principles were initially defined in a paper published in the journal Scientific Data in March 2016 by a group of scientists and organizations.

The FAIR principles place a strong emphasis on machine-actionability, meaning that computational systems can locate, access, exchange, and repurpose data with little or no human intervention. This is critical as humans increasingly rely on computational support to manage the sheer volume, complexity, and pace of data generation. Support for FAIR principles for raw data has motivated interest in metadata standards supporting reproducibility. 

I like Whitaker’s matrix of reproducibility – he described an analysis as “reproducible” in the narrow sense that a user can produce identical results provided the data and code from the original.

 

Whitaker’s matrix of reproducibility

 

The first principle of FAIR is "Findability", which emphasizes the importance of making data discoverable and easy to find. 

The authors argue that reproducibility is a fundamental aspect of making data findable, as it allows others to discover and verify the data and results of a study. Without reproducibility, it can be difficult or impossible for others to locate and use the data, which can limit its potential impact and value.

One of my favourite example use cases of metadata is around data quality and discoverability for data reuse. This includes something called ontologies. Ontologies are descriptions and definitions of relationships. E.g. classes, instances, relationships among things, properties of things, functions, processes, constraints, and rules related to things. 

An example of ontological relationship in metadata would be an ‘iPhone” is a subject of an object class, “cell phone”. There is a range of perspectives on what constitutes an ontology, leading to the notion of an "ontology spectrum" that characterizes certain frameworks as weak and others as strong. This spectrum reflects the diversity of views within the field.

Without accurate and comprehensive metadata, it is difficult or impossible to reproduce the results of a study, which can limit the impact and value of the research. It’s imperative we put more effort and resources into metadata standards that ensure FAIR principles for reproducible structured data in computational research.

This piece is 28/50 from my 50 days of writing series. Subscribe to hear about new posts.

References: 

The role of metadata in reproducible computational research:
https://www.sciencedirect.com/science/article/pii/S2666389921001707 

Why is metadata important as the data itself?
https://www.opendatasoft.com/en/blog/what-is-metadata-and-why-is-it-important-data/ 

Parameter metadata:
https://www.ibm.com/docs/en/cdfsp/7.6.1.1?topic=scripts-parameter-metadata 

FAIR data: 
https://en.wikipedia.org/wiki/FAIR_data 

Ten Simple Rules for Reproducible Computational Research: 
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285