Building the DiSSCo Knowledge Graph

Linking building in DiSSCo, to build the unified knowledge graph around and between digital specimens is something I’ve been thinking about for a few months now. How would you do it, authoritatively with confidence, in an automated way at scale for 1.5 billion specimens across European natural science collections? This is quite a challenge. My experience of doing it manually (see these examples) can basically be characterized as fishing in deep pools for something you know is there but can’t see!15984919575_b3fe17b627_k

Inspired by the description of the Event Data service, jointly developed by Crossref and Datacite I’ve written a description for a DiSSCo service I name as the Tahana link builder. Tahana means thirsty in the Marathi language of West India and seems apt as this a service thirsty for potential links. Tahana is also a Maori boy’s name meaning ‘chief’ and in some pseudophilosophical interpretations it means to assist or help. These are also apt labels: ‘Tahana – DiSSCo’s chief thirsty link helper’!

Tahana link builder finds and exposes the third-party information associated with natural science specimens. Specifically, Tahana link builder reveals and builds links between specimens and publications about specimens, citations of specimens, derivatives of specimens such as derivative parts (skeleton, skin, tissue samples, etc.), DNA sequences, and other information. This service is developed in collaboration with ????? and is inspired by the Scholix Schema, a joint initiative of the Research Data Alliance and the ICSU World Data System.

Potential links are discovered via the physicalSpecimenIdentifier, institutionCode (or name) and scientificName attributes of digital specimens, and external services (so-called ‘sources’) that collect information about and related to specimens. Examples of sources include Biodiversity Heritage Library, TreatmentBank, European Nucleotide ArchiveBioSamples, Wikidata – to name but a few. Sources are probed for evidence of specimens (used, mentioned, referenced) which is then matched to known specimens in natural science collections. Fuzzy logic is used to develop, identify and resolve potential links between objects.

The inspiration from Scholix is the idea, described in the Scholix schema that packages of information about two objects and the relationship between them can be constructed and treated as atomic entities for further processing. The provenance of the work to construct the package can be recorded in the package itself. It means also that the package can be passed from one application to another to be processed. And it means that Tahana can work independently and asynchronously of the DiSSCo unified knowledge graph it is trying to fill and extend, sending link packages to another DiSSCo process to incorporate them into the graph. This allows that packages can be constructed using other means than by Tahana, including manually, as appropriate. Especially, going into the future with new digitisations and accessions, link packages can be constructed as natural outputs of institutional digitisation and accession processes. This makes the idea very powerful for digitally building extended specimens and next generation collections.

An observation that won’t be lost on readers familiar with DiSSCo Digital Specimen (DS) architecture is that link packages are themselves digital objects in digital object architecture, persistently identified and traceable such that they can easily be corrected when necessary as well as usable for multiple purposes. By the way, if you like, here’s a link to a philosophical treatise on digital objects as socio-technical objects

Beyond this reasonably straightforward but first attempt at describing a link builder service, there are several questions to address as part of the early service design:

Scope: What kind of links are we looking for? There’s a basic distinction around links themselves, which can be of two types, actual links and conceptual links. Actual links are links between a specimen and another specimen or piece of information directly derived from that first specimen e.g., a feather sample from a bird skin, a parasite found on a specific bird, a DNA sequence of it. These are the links of interest to Tahana.

The other kind of links, conceptual links are links between samples/data that were not taken or derived from the specific collected individual specimen but are indirectly related to it in some way e.g., this DNA sequence came from a bird of this same species but not from this specific catalogued individual, or this audio recording of the bird’s song came from another bird like this one. Many examples of conceptual links are to the (taxon) concept the specimen represents rather than to the specimen itself. The range of linked supplementary information that relates to the concept the specimen represents is very broad, as well as being very useful to aid understanding and generation of new knowledge about specimens. These types of links are not what Tahana is concerned with. But there is still a need for a tool that can build conceptual links and apply them to all specimens representing a specific concept. It’s not Tahana.

Actual links: How do we identify them?  Identifying potential links manually shows how difficult it is if you don’t have any prior knowledge. Let’s take a real example and explore that manually before coming back later to see how it might be with automation.

First, you need to know what to look for. Ideally, some kind of identifier of a specimen shows up in a source, like in this DNA sequence FJ788436.1 of a marine worm named Holorchis castex, which can be found both in the European Nucleotide Archive (ENA) and NCBI GenBank (here).

Diagramma pictum

This worm is parasitic on a tropical fish, the painted sweetlips (Diagramma pictum), widespread throughout the Indo-West Pacific region. It was first recorded and described in 2007 by Bray and Justine[1]. More on that later.

The specimen voucher field on the source features tab of the DNA sequence page shows an identifier ‘BMNH:2006.12.6.4041’ that can be traced back to a record in London’s Natural History Museum (NHMUK). But it’s not easy. First you have to know or look up the code BMNH in the Global Registry of Scientific Collections (GRSciColl) to find that it refers to the ‘British Museum (Natural History) Collection’. Actually, if you look up ‘BMNH’ in GRSciColl you’ll get it wrong. In GRSciColl that code is given as the Burpee Museum of Natural History in Rockford, Illinois. Searching Google for ‘institution code BMNH’ throws up misleading results as well. Sometimes you have to guess and this short article on standard symbolic codes for institution resource collections in herpetology and ichthyology illuminates just how difficult that might be. In the end, Googling for ‘collection code BMNH’ leads to a Wikispecies page that reveals the truth. This is corroborated by looking at the original description in ZooTaxa (circumscription) of the specimen as a new species, the full text of which is available on ResearchGate, which states that the abbreviation BMNH means the British Museum (Natural History) Collection at the Natural History Museum. However, tracking down that original description itself was not easy as it’s not mentioned on either the NHMUK record page or the ENA’s sequence page. You must know, for example to look up the species name in the Species 2000 & ITIS Catalogue of Life to see if a literature reference has been associated to the taxon, or similarly in WoRMS. Another source to try is in the collection of taxonomic treatments maintained online as TreatmentBank. Here, Plazi has sucked the highly standardized language of the taxonomic treatment out of the original description and has provided an identifiable, annotated and readable record of the treatment data – for both humans and machines in Darwin Core, XML and RDF representations.

Looking at the original description confirms that indeed the literature relates to the specimen we’ve identified in London because the specimen number is directly mentioned there as one of the paratype specimens. Another paratype specimen, MNHN JNC1848-D2-3 is mentioned, as is the holotype MNHN JNC1848-D1. Both of these specimens are in the Muséum National d’Histoire Naturelle, Paris but as yet neither are listed as part of the MNHN online collection. Further investigation of the information on the original ENA sequence page FJ788436.1 reveals that one of the original description authors, Bray was responsible for submitting total genomic DNA (gDNA) from the specimen (among others) for sequencing in early 2009, leading to a further article on phylogenetic relationships that same year [2].

A final point, about a difficulty that arose early on with the specimen identifier itself: On the ENA’s sequence page the specimen_voucher is shown as ‘2006.12.6.4041’ whereas on NHMUK’s record page the specimen’s catalogue number is shown as ‘2006.12.6.40-41’ i.e., with a hyphen inserted. This is a small difference easily detectable by humans but indicative of a much more difficult problem for automated link detection, as you can never be sure of the full set of possible mutations that can exist!

So, after some detective work we’ve identified several possible relations:

Subject Predicate Object
FJ788436.1 isSequenceOf BMNH:2006.12.6.40-41
BMNH:2006.12.6.4041 isSameAs BMNH:2006.12.6.40-41
BMNH:2006.12.6.40-41 isParatypeOf MNHN JNC1848-D1
BMNH:2006.12.6.40-41 hasHolotype MNHN JNC1848-D1
BMNH:2006.12.6.40-41 isSpecimenOf Holorchis castex
BMNH:2006.12.6.40-41  isCircumscribedBy doi: 10.11646/zootaxa.1426.1.3
BMNH:2006.12.6.40-41 hasTreatment
Darwin Core isRepresentationOf
XML isRepresentationOf
RDF isRepresentationOf

Returning to the Scholix inspiration, each of these can be encoded as a separate link package.

Link relations: What are the possible relation types? Once a potential link has been discovered we must characterise it in terms of the relationship it represents between the two objects to be linked, as should be clear from the example above.

Whereas one object, A will (almost) always be a specimen in a collection, the relation to object B will depend on the nature of B. B can be one of several possible types:

  • Another specimen;
  • Literature (article, general);
  • Literature (article, circumscription);
  • Treatment (taxonomic);
  • Tissue sample (in a biobank);
  • Genome sequence;
  • Image/model (most likely, specialised image/model types beyond those associated with normal digitisation processes; such as 3D printable models);
  • Gathering / collection event;
  • Others?

Relation types are many and related to the nature of objects A and B. Here is a small list:

  • isSpecimenOf indicates that B represents (taxon) concept A;
  • isDerivativeOf indicates that B is a derivative of specimen A;
  • isPartOf indicates that specimen B is a part of A in the sense that they belong together; for example a bone, B isPartOf a skeleton A when B is when separately identified and curated, or a plant A isPartOf a gathering B;
  • isSampleOf indicates that B is a small part, cut or sampling of A;
  • isDistributionOf – B is a distribution of, or part of the set of A, in the sense of exsiccata, especially in botany where it refers to widely distributed sets of specimens that are used as standards for comparison;
  • isParatypeOf;
  • hasHolotype;
  • isCircumscriptionOf;
  • isTreatmentOf;
  • isDescriptionOf;
  • isSequenceOf;
  • isReferencedBy;
  • isRepresentationOf;
  • isSameAs;
  • … … etc.

All these relations describe in some way the connection between two objects, A and B in terms of how B is a part of a greater whole, A. Already the reader can see there is some overlap and confusion in my list so refinement is needed to ensure the defined relations are the ones commonly understood and in everyday use by curators and scientists. The Access to Biological Collections Data (ABCD) schema might be one source to look at to obtain better understanding of that.

Another aspect is that while most of these relation types are ‘isA‘ relations, there are one or two ‘has‘ relations e.g., hasHolotype. In ontology engineering ‘isA‘ and ‘has‘ relations are considered the inverse of one another and you can normally infer one if you know the other. Sometimes though, especially for humans and to optimise machine processing it can be helpful to explicitly define inverse relations even though it might not be formally necessary to do so. This idea of inverse slots (to use the formal term) is explained in section 5 of this Ontology 101.

There are other relation types that have a different kind of meaning than constituting or making up a whole. These have meaning based upon some kind of scientific premise or interpretation based on knowledge, such as the following (which is surely not an exhaustive list):

  • isParasiteOf
  • isSymbiontOf
  • isSoundOf
  • isBehaviourOf
  • … … etc.

What’s important in this list is that the relation is STILL an actual relation between two specimens and not a relation between a specimen and the concept it represents i.e., the parasite B was really found on specimen A, or the birdsong recording B is really a recording from bird A before it was captured.

Links with potential: How do we know our potential link is a real link? Is it a definite link or only a possible link? How might we resolve that? On a fuzzy descriptive scale, the statuses a potential link can have are:

  • Definite link
  • Probable link
  • Possible link
  • Probable no link
  • Definite no link

These statuses reflect the uncertainty surrounding a potential link when discovered and identified by computer algorithms. Fuzzy logic allows us to assign a numeric value between 0 and 1 to reflect this and we must find ways in the algorithms Tahana uses to reduce uncertainty.

For the three probable and possible statuses mechanisms are needed to properly resolve the status one way or the other and to make the link (or not). Such mechanisms probably include an element of expert human intervention and the challenge is to find ways to resolve statuses that don’t lead to overwhelming numbers of probably or possible links that a human must look at.

What does automated link building consist of then? Identifying and making links probably consists of five stages, and it’s likely that a variety of classification/machine learning techniques will be appropriate:

  1. Foraging, which includes activities that extract information from the sources of information about specimens, and clean and organise that ready for analysis;
  2. Analysis, which tries to find associations between pieces of information in the different corpuses of data extracted from the sources. Various strategies will be needed here to uncover potential associations;
  3. Suggestion, which takes the results of analysis and tries to classify (using the fuzzy descriptive scale mentioned above) associations into definite, probable, possible links and to suggest these to the user. Definite links can be moved directly to the validation stage but probable and possible links will need some human intervention, especially in the early use of the link builder as it learns;
  4. Validation, which consists of two activities: auto-validation, where a machine algorithm can validate the suggestion and assisted validation. Assisted validation is used for the difficult cases where human intervention must assess and make a judgement about the validity of the suggested link.
  5. Packaging, where link packages are formed, together with their accompanying provenance and other details and deposited in a temporary store for further use.

How is this all useful? Using and storing link packages as the basis for constructing the knowledge graph is a more general mechanism than adapting the Digital Specimen schema to add new sources of third-party information. Not only that, but using link packages decouples the provenance information surrounding link building from the provenance information of the Digital Specimens themselves. Link packages offer another of layer over which exploratory and analytical tools can operate. Just like Crossref event data opens up the possibility to analyse communication events related to publication of journal articles, link packages offer the opportunity to explore, analyse and promote the use of specimens.

That’s it for now. There are other design inspirations I’m still exploring but these are initial ideas to stimulate feedback to develop further. Please feel free to send me your comments.

Update, 16th September 2019

I stumbled across this article on learning to complete knowledge graphs with deep sequential models (doi: 10.1162/dint_a_00016) which gave me some inspiration. Next, I need an appropriate training dataset with a few thousand specimens and their relations in it. So that’s the first job – build the training dataset!

[1] Bray, R. A. and Justine, J.-L. (2007). Holorchis castex n. sp. (Digenea : Lepocreadiidae) from the painted sweet-lips Diagramma pictum (Thunberg, 1792) (Perciformes : Haemulidae) from New Caledonia. Zootaxa. 1426, 51-56. doi: 10.11646/zootaxa.1426.1.3

[2] Bray R.A., Waeschenbach A., Cribb T.H., Weedall G.D., Dyal P., and Littlewood D.T.J. (2009). The phylogeny of the Lepocreadiidae (Platyhelminthes: Digenea) inferred from nuclear and mitochondrial genes: implications for their systematics and evolution. Acta Parasitol. 54:310-329. doi: 10.2478/s11686-009-0045-z

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s