/ Data Integration

Allotrope 101 - Part A

Don't know what Allotrope is? Get a 5-minute overview here. The crash course provides a basic introduction to the relevant concepts that will help you better understand the Allotrope framework and leverage it within your organization's data strategy and solution.

Written by:
Benjamin J Woolford-Lim (Benjamin.j.woolford-lim@gsk.com),
Kostadin Alargov (kalargov@tetrascience.com)
Vincent Chan (vchan@tetrascience.com)
Spin Wang (swang@tetrascience.com)

Overview

Our recommendation for learning and using Allotrope is to

  • Learn the basic concepts of the Semantic Web and other fundamental concepts in Allotrope 101 - Part A (this blog)
  • Learn the basic concepts of the Allotrope Framework itself in Allotrope 101 - Part B (coming soon)
  • Read and try out ADF Primer to become familiar with basic coding examples
  • Read and try out ADF Leaf Node Automation to start generating simple ADF files for real-world instruments
  • Start your development and prototyping using Allotrope Documentation Website

If you have questions or get stuck, don’t worry!

Part A: Motivation and Underlying Concepts

Motivation

As more and more life science companies are transitioning to becoming increasingly data-driven, it is crucial that organizations have access to high-quality data sets.

Historically, scientific instruments were not designed as open systems. Instruments usually produce data in their own proprietary or vendor-specific format.

This creates a huge barrier for data to be leveraged effectively in the life science industry by generating a variety of different data silos, which further reduces the ability to gain insights and perform analytics on scientific data.

Allotrope Foundation aims to revolutionize the way we acquire, share, and gain insights from scientific data, through a community and a framework for standardization and linked data.

Underlying Concepts

Semantic Web

The Semantic Web is an extension of the World Wide Web and aims to provide a common framework for data to be shared and reused across applications, enterprises, and community boundaries. The Semantic Web is therefore regarded as an integrator across different content, information applications, and systems.

Leveraging the Semantic Web philosophy to model scientific data.

With a Web of information, anyone may contribute to knowledge about a
resource. It was this aspect of the current Web that allowed it to grow at such an unprecedented rate. To implement the Semantic Web, there needs to be a way that allows information to be distributed over the Web. One of the most common approaches, defined by the W3C, is called Resource Description Framework (RDF) and it relies on the concept of "triples".

RDF

Semantic Web promotes common data formats and exchange protocols in the web,
Resource Description Framework (RDF) being the most popular modeling language, and the basis on which several others such as Web Ontology Language (OWL) are built. RDF extends the linking structure of the Web by using URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a “triple”). Using this simple model, structured and semi-structured data can be mixed, exposed, and shared across different applications. Note that RDF as the data model is distinct from RDF/XML which is a means of representing RDF in XML, largely superceded by other, easier to use formats such as Turtle.

Triples

Triples are a standard way of storing elements, or facts. A set of triples can be combined to represent a graph of data and relationships, such as an ontology. Triples consist of a subject, a predicate, and an object. For example,

system-A hasType chromatography-instrument
system-A hasPart detector-1
detector-1 hasType UV-detector

The subject is the resource the fact is about. It is usually either a class from an ontology, or some instance of an entity in the overall graph.

The predicate is the relationship between the subject and the object,
such as the type of detector an instrument has.

The object is the value this fact asserts is related to the subject. It can either be another resource like the subject i.e. an instance of an entity or an ontology class, or it may be some fixed literal value such as 500.14 or "Allotrope".

Here is a snippet of the QUDT's work that contains some facts about the unit of Atomic Mass, namely u or dalton (Da), expressed in Turtle syntax.

unit:AtomicMassUnit
      rdf:type qudt:AtomicMassUnit ;
      rdfs:label "Atomic mass unit"^^xsd:string ;
      qudt:abbreviation "u"^^xsd:string ;
      qudt:code "0486"^^xsd:string ;
      qudt:conversionMultiplier
              1.66053878283e-27 ;
      qudt:conversionOffset
              "0.0"^^xsd:double ;
      qudt:description "The unified atomic mass unit (symbol: u) or dalton (symbol: Da) is a unit that is used for 
      indicating mass on an atomic or molecular scale. It is defined as one-twelfth of the rest mass of an unbound atom 
      of carbon-12 in its nuclear and electronic ground state,[1] and has a value of 1.660538782(83)×10−27 kg.[2] One Da
      is approximately equal to the mass of one proton or one neutron. The CIPM have categorized it as a \"non-SI unit 
      whose values in SI units must be obtained experimentally\".[1] [Wikipedia]"^^xsd:string ;
      qudt:symbol "u"^^xsd:string .

Once you have a graph, it can be queried using SPARQL, a query language very similar to SQL but designed for
semantic content. For every Allotrope Data Format (ADF) file, its Data Description is able to store triples about the
data contained in the file, using Allotrope Foundation Ontologies (AFO) to provide consistent terms and relationships across instruments, techniques, disciplines, and vendors.

RDF provides a mechanism that allows anyone to make a basic statement about anything and layer those statements into a single graph.

Now imagine an instrument being able to automatically produce those statements and add them into a big pool of scientific data with consistent terms and relationships between the data sets. The scientists and data analysts can now spend the majority of their time analyzing and gaining insights into the data, instead of simply trying to make sense of it.

That will be pretty powerful, won't it?!

URI and IRI

URI stands for Uniform Resource Identifier and IRI stands for International Resource Identifier.

URIs are a way to uniquely identify a resource or name something. By leveraging URIs in the RDF framework, it is possible to represent ontologies in a unique and even resolvable way.

The difference between URIs and IRIs is that URIs work on ASCII characters,
and IRIs can accommodate the Unicode character set.

SPARQL

SPARQL (pronounced "sparkle" and short for SPARQL Protocol and RDF Query Language) is a query language to query the triples. It is structurally similar to SQL. See https://jena.apache.org/tutorials/sparql.html for a tutorial.

Here is a set of triples in the Turtle format.
These triples form a graph that describes the fact that a cell counter measures the total cell count to be 1972.0.

<http://purl.allotrope.org/ontologies/result#AFR_0001114>
    <http://www.w3.org/2004/02/skos/core#prefLabel>
            "total cell count" .

<urn:uuid:354fd520-f0b5-410e-abe9-90fb510e8683>
    a       <http://purl.allotrope.org/ontologies/result#AFR_0001114> ;
    <http://qudt.org/schema/qudt#numericValue>
            "1972.0"^^<http://www.w3.org/2001/XMLSchema#double> ;
    <http://qudt.org/schema/qudt#unit>
            <http://purl.allotrope.org/ontology/qudt-ext/unit#Cell> .

Here is an example of SPARQL query that obtains the total cell count from the previous graph.

prefix qudt: <http://qudt.org/schema/qudt#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?result
WHERE {
    ?o skos:prefLabel "total cell count" .
    ?s rdf:type ?o .
    ?s qudt:numericValue ?result .
}

Leveraging triples and SPARQL, you can perform powerful queries on top of highly connected data sets and this is one of the key benefits of using the Semantic Web.

SHACL

Shape Constraint Language, is a language that defines and validates constraints on RDF graphs. It is a relatively new standard from the W3C. See https://www.w3.org/TR/shacl/ for more details.

Using triples from RDF allow you to express facts in a manner that can connect to other facts across datasets and domains, but the flexibility of triples can cause inconsistency in how data is represented. This is where SHACL comes in.

A SHACL validator can validate whether the RDF graph is valid or not with respect to a set of constraints or rules expressed in SHACL. By encoding data models in SHACL, we can use this kind of automatic validation to check our triples and graphs conform to the expected model, with all required information in the correct place and linked in the right way.This ensures that datasets can be searched via the same SPARQL queries, and enables easy and consistent linking to other datasets as the structure of the data is known and validated.

An example SHACL file snippet looks like this:

unnamed:checkForEntityNode
  rdf:type sh:NodeShape ;
  sh:property unnamed:checkNumericValue ;
  sh:property unnamed:checkUnit ;
  sh:property [
      sh:path rdf:type ;
      sh:maxCount 1 ;
      sh:minCount 1 ;
      sh:nodeKind sh:IRI ;
    ] ;
  sh:targetClass <http://purl.allotrope.org/ontologies/result#AFR_0001111> ;
.
unnamed:checkNumericValue
  rdf:type sh:PropertyShape ;
  sh:path <http://qudt.org/schema/qudt#numericValue> ;
  sh:datatype xsd:double ;
  sh:maxCount 1 ;
  sh:minCount 1 ;
  sh:nodeKind sh:Literal ;
.
unnamed:checkUnit
  rdf:type sh:PropertyShape ;
  sh:path <http://qudt.org/schema/qudt#unit> ;
  sh:hasValue <http://qudt.org/vocab/unit#Percent> ;
  sh:nodeKind sh:IRI ;
.

This snippet(unamed:checkForEntityNode) tries to make sure that for an entity node in the graph that belongs to class viability
(defined as http://purl.allotrope.org/ontologies/result#AFR_0001111 in the Allotrope Foundation Ontology), there is one and only one numerical value and one and only one unit.Notice that the SHACL file is also presented as a set of triples in the Turtle format.

Taxonomy

A hierarchical classification of entities, using the same relationship type, e.g. "is a subclass of" throughout. Taxonomies are typically represented by a tree structure, such as the animal kingdom taxonomies.

Ontology

A superclass of taxonomies, with several different relationships,
e.g. "is a", "has a", "contains a", and with multiple inheritances allowed in the same ontology. Whilst taxonomies can be represented as a tree due to their hierarchical nature, ontologies have more complex relationships and are modeled as graphs.

Graphs and graph databases

Data are most often represented in tabular form (think about your relational databases), so you may naturally wonder why we want to model the data as a graph over the more traditional relational data.

Graphs are a powerful way to store and explore unstructured and semi-structured data in particular, when you need to create relationships between data and quickly query these relationships.

Graph databases have advantages over relational databases for use cases like social networking, recommendation engines, and fraud detection, where the relationships between data are arguably as important as the data itself.If you use traditional relational databases, you would need a large number of tables with multiple foreign keys to store the data, which are difficult to understand and maintain. Furthermore, using SQL to navigate this data would require nested queries and complex joins that quickly become unwieldy, and the queries would not perform well as your data size grows over time.

In graph databases, the relationships are stored as first-order citizens of the data model, as opposed to relational databases which require us to establish relationships using foreign keys. This allows data in nodes to be directly linked, dramatically improving the performance of queries that navigate relationships in the data. It also enables the model to map closely to our physical world.

Here is an article that compares graph databases and relational databases.

Here is a list of the graph databases and a ranking of their popularity.

Acronyms and glossary

  • BFO: Basic Formal Ontology, an upper level ontology used to ensure consistent usage and linking of terms across different ontologies. It is widely used in the biomedical space, including serving as the basis for every ontology in the Open Biological and Biomedical Ontology Foundry (OBOFoundry). See http://ifomis.uni-saarland.de/bfo/ for more information.
  • HDF5: A binary file format, optimized for high-performance access to large datasets. Used as an underlying technology in ADF. See https://support.hdfgroup.org/HDF5/ for more details.
  • Jena: An Apache open source Java API supporting the use of Semantic Web approaches such as triples and SPARQL queries. Used as an underlying technology or the Data Description layer of ADF. See https://jena.apache.org/ for more details. Resource is a super-type of Property in Jena.
  • Jena Fuseki: A popular tool to easily test SPARQL queries. See https://jena.apache.org/documentation/fuseki2/ for more information and download.
  • Protégé: A standard ontology development and exploration tool, developed by Stanford University and provided free for general use. See http://protege.stanford.edu/ for information and to download.
    See https://wiki.csc.calpoly.edu/OntologyTutorial/wiki/IntroductionToOntologiesWithProtege for a basic tutorial on use of Protégé.
  • SME: Subject Matter Expert. Someone with a good understanding of the specific domain that software development is supporting.
  • Triplestore: A database-like storage mechanism for triples, such as Jena-Fuseki.
  • Turtle: A syntax for representing RDF triples in a more human-readable form than the RDF/XML standard. It is structurally similar to the SPARQL language. See https://en.wikipedia.org/wiki/Turtle_(syntax) for more information.

TetraScience

TetraScience is a data platform for the life sciences helping companies connect their lab. Centralize and standardize data across your enterprise with our Data Integration Platform. Integrate, visualize, and analyze your data from internal and external sources with a dedicated API and data lake.

Learn more about TetraScience Integration.

If you would like to have more in depth conversation and understand how to leverage Allotrope for your organization, email allotrope@tetrascience.com. We are here to help!