Widening access to European natural science collections with Digital Specimens and Natural Science Identifiers (NSId)

Digital transformation of physical natural science collections has been underway for some time already but with the inclusion of the Distributed System of Scientific Collections (DiSSCo) into the ESFRI Roadmap 2018 there is a new stimulus to unify a fragmented landscape into a coherent and responsive research infrastructure where ‘Digital Specimens‘ and ‘Natural Science Identifiers‘ can play an important role to improve access for scientists, policymakers and the public.

For more than 300 years, scientists have collected and studied plants, animals, rocks, minerals, and fossils from our planet. Constituting 55% of the global asset base, and representing 80% of the world’s species, more than 1.5 billion physical specimens are housed, organised and catalogued as collections in many hundreds of institutes and museums across Europe. Together, these represent an unparalleled resource, a scientific infrastructure for knowledge and discovery about the world’s biodiversity; it’s past, present and future and its influence on global challenges in environment and society. In June 2018 the European Strategy Forum on Research Infrastructures (ESFRI) accepted the importance of this resource and included the Distributed System of Scientific Collections (DiSSCo) into the ESFRI Roadmap 2018 as a priority research infrastructure to commence operations in 2025.

Similar initiatives are underway in other parts of the World. The iDigBio project in the USA coordinates the national digitisation effort and builds a national infrastructure supporting the aim to build a permanent repository of digitized information from all U.S. biological collections. The National Specimen Information Infrastructure (NSII) of China has, since 2003 been building a system for organising and sharing information related to natural science specimens, including names, literature and images. NRCA Digital is Australia’s program for digitising the National Research Collections of Australia. And there are many others.

With lifespans of more than 30 years, infrastructures such as DiSSCo, iDigBio and NSII aim at transforming today’s slow, expensive, inefficient and limited systems. The need to physically visit collections and the absence of linkages to other relevant information represent significant impediments. By inserting a ‘Digital Specimen‘ object layer between the data infrastructure of natural science collections and the user applications wanting to process and interact with digital information about specimens and collections, it’s possible to seamlessly organise access to digital information spanning multiple institutions and countries. Such virtual collections, where the details of physical specimens are digitally represented in cyberspace, offer the possibility for wider, more flexible, ‘FAIR’ (Findable, Accessible, Interoperable, Reusable) access for a varied range of biodiversity science and policy applications. This includes directly arranging access to specimens via loans and visits, annotating specimens with the latest taxonomic treatments, gaining better understanding of geographical variations, and working with the associated tissue samples and DNA sequences of the specimen. Such an approach is expected to lead to faster insights for lower cost on many fronts.

Digital Specimens

As well as acting to seamlessly organise a virtual collection, Digital Specimens act as digital representations in computer systems. Corresponding to and acting as surrogates for the physical specimens in collections, they typically organise information about what the specimen is, where and when it was collected and by whom, and (through the inclusion of images) what it looks like.

Formally speaking, Digital Specimens are a specific form of the more general-purpose Digital Object (DO), which is defined by Digital Object Architecture (DOA) as ‘a sequence of bits, or a set of sequences of bits, incorporating a work or portion of a work or other information in which a party has rights or interests, or in which there is value, each of the sequences being structured in a way that is interpretable by one or more of the computational facilities, and having as an essential element an associated unique persistent identifier.

As a specific type of Digital Object, Digital Specimens can support wide-ranging general-purpose information management and research/teaching needs in natural sciences beyond just conveying information from one location to another in the Internet domain.

Digital Specimens contain the history of a specimen as it is collected, identified, assigned to a specific taxon and named. They record changes to that understanding over time as annotations that can be recalled for inspection. Digital Specimens can also be thought of as a dynamic ‘filing box’ where traceable, directed links to much of the core information known about and associated with the specimen can be gathered and organised in a single place (see Figure 1).

FilingBox

Figure 1: Digital Specimens can be thought of as a dynamic ‘filing box’. PID = persistent identifier

This information can include references respectively to relevant scientific literature, tissue samples, genetic sequences and trait data, agricultural, toxicology and ecosystem data and more. Conceptual relations, such as ‘isParentOf’, ‘isEvolvedFrom’ or ‘isSymbioticWith’ can be established between specimens. DiSSCo places Digital Specimens at the heart of an interconnected graph of diverse and dispersed data classes, equipping them for many research and teaching purposes that might not otherwise be possible.

Here are some things you can do with Digital Specimens:

Find related literature and genetic sequences: By following the links filed with the details about the specimen, you can find related literature in, for example the Biodiversity Heritage Library and DNA sequences in GenBankGenBank.

Study trait variation in a species: Without having to physically visit all collections or borrow all examples of a species, you can study variation in morphological traits by examining high resolution images and 3D models of specimens.

Train machine learning algorithms for automated identification of species: Deep learning methods can be used to classify herbarium specimens, for example.

3D printing from 3D models: Teraoka Gensyou, a toy designer and comic artist from Japan, downloaded a 3D model of a Woolly Mammoth from the Smithsonian Institution, and created a new version with posable joints that can be 3D printed in resin and assembled. He created a manga (comic) about the whole process too. In a similar vein, 3D models can used to create 3D holographic projection models for display purposes.

Natural Science Identifiers

Preserved museum specimens have a lifespan that is always many decades, and often expected to be several hundreds of years. Specimens must be unambiguously identifiable and traceable in the face of changes in physical location, changes in the organisation of the collection to which they belong, and changes in taxonomic classification. In the context of digitizing museum collections, a clear link must be established between the physical specimen itself and the information digitally representing that specimen in cyberspace. When digitizing natural science collections, the new idea of a Natural Science Identifier (NSId) as a neutral, unique, universal and stable long-term persistent identifier of a Digital Specimen can be seen as central to  museums’ ambitions for widening access to natural science collections. A Natural Science Identifier allows unambiguous, easy identification and referencing of specific Digital Specimens, regardless of location, owner or user. As well as providing a digital doorway to physical specimens through which services for arranging loans and visits can be accessed, an NSId opens the door to innovative services for manipulating specimens information directly, for work reliant upon discovery of related third-party information, and for demanding 3D modelling and visualization of specimens. Because the work takes place within e-Infrastructures/Cyberspace, new possibilities based on working with hundreds of thousands of specimens simultaneously are opened up; by exploiting large-scale cloud computing capacity and deep mining/machine learning/AI techniques, for example.

Adopting an identifier mechanism for NSIds

There are several established identifier mechanisms to use as a basis for NSId, including HTTP URIs, PURLs, ARKs, and Handles. Aside from being stable and sustained over time, an essential requirement of a global identifier mechanism is that it must be independent of any of the museums and other institutions likely to be assigning identifiers to Digital Specimens. NSIds should be opaque insofar as no information can or should be inferred by solely inspecting the identifier. The reason for this is that over the long-term, stakeholders change, collections move, and organisations evolve and even merge or disappear. In short, ownership and location of objects can change. Information about an object should only be revealed when the identifier is ‘resolved’ i.e., when looked up in some neutral index.

The ICEDIG design refinement project is presently discussing whether to recommend to DiSSCo to adopt the Handle System as the basis for NSIds and to develop a specific NSId application and registry to support them.

The Handle System is a reliable and mature system for creating (minting) and resolving persistent identifiers, backed by multiple non-commercial and commercial stake-holding organisations. Its best-known application is Digital Object Identifiers (DOI) for unambiguously identifying and locating academic journal articles. More than 100 million DOIs exist today, making it possible to clearly cite and cross-reference different works, as well as providing a simple means of communicating works from one person to another. Other, less well-known applications of the Handle System include financial derivatives, the entertainment (film and TV) industry, international shipping and the UK civil construction sector.

DataCite is a global provider of DOIs for research data while the European Persistent Identifier Consortium (ePIC) proposes handles for artefacts of eResearch/eScience, principally datasets, workflows, software and research objects. Somewhat like the Domain Name System (DNS) for managing Internet domains, the role of the Handle System is to manage Digital Objects i.e., to provide a simple, fast and efficient mechanism for finding things, even after they have physically moved from one place to another.

NSIds based on Handles would have a similar form to the more familiar Digital Object Identifiers (DOI) used for journal articles. The form is: ‘prefix slash suffix’ (e.g., pp.nnnnn/ssssss). That is, they consist of a prefix part (pp.nnnnn before the /) that identifies the institution responsible for creating (minting) the NSId, and a suffix part (ssssss after the /) identifying the specific object or specimen. Note that the precise structure and allocating authorities for prefixes and suffixes still must be agreed if the NSId concept is adopted.

Like DOIs and journal articles, Handle-based NSIds are permanently associated with a Digital Specimen. They are resolvable links to information about that specimen, no matter where that information is located in the Internet. Again like DOIs, NSIds can be resolved within the Handle System to values of one or more data types. These can include, for example:

  1. A World-Wide Web landing page for the specimen displaying basic information;
  2. A URL to more information about other available information and sources for that;
  3. Collection and physical specimen identifiers, allowing the specimen to be located and seen in a museum;
  4. Machine-readable versions of any of the above;
  5. High and/or low resolution images of the specimen;
  6. A service entry point capable of processing some or all parts of the information contained in the object.

Information about and/or inside the Digital Specimen object can change over time but the NSId, once assigned never changes and, because of the nature of natural science specimens more-or-less never becomes obsolete. Even if the physical specimen decays or is destroyed, the information available about it digitally is never deleted. NSIds and the Digital Specimens they refer to can remain valid for several hundred years.

Benefits from universal, Handle-based NSIds

Universal, Handle-based NSIds possess several characteristics that make them especially suitable for the digital natural sciences domain. The first characteristic, already mentioned is that of being opaque, meaning that a well-formed NSId does not or should not allow to infer any specific information or value about the specimen to which it refers. This makes the identifier neutral to changing interpretations and circumstances yet persistent in the face of such change; which over long periods of time is a certainty. Opaqueness helps to alleviate ‘link decay’ caused by events such as organisational structural and name changes. The top-level, global Handle resolution mechanism itself is robust and stable, being strongly backed by commerce, industry, non-governmental agencies and public sector stakeholders. The specific mechanism for creating and maintaining Digital Specimens and minting/resolving their NSIds will be backed, initially by the 115 natural science institutions across 21 European countries forming the DiSSCo partnership and in the longer term this would be expected to expand to other institutions in Europe as well as far afield around the World.

The second characteristic is ease of resolution. Unlike some other identifier schemes (and like DNS) it is not necessary to know precisely where to find the correct local service needed for fully resolving an NSId. Resolving an NSId through any well-known Handle server, such as doi.org or hdl.handle.net results in the user being redirected via the correct local handle service to the information associated with the identifier. This is valuable from a branding and usability perspective because it means NSIds can be integrated invisibly with other types of handle, especially DOI, in composite applications while at the same time retaining their own independence and brand. Many scientists already know the ease of finding and communicating journal article information just by copy/paste/noting of a DOI, also of the available third-party services like Cited-by and Crossmark that are enabled by DOIs. The same should be true for digital specimens. NSId makes it easy to identify, unpack or communicate the content of the ‘filing box’ containing much information relevant to a specific specimen, including links to its physical location and firing off loans and visits requests.

How NSId work alongside existing specimen identifiers

As already mentioned, NSIds are needed to deliver the seamless virtual European collection of natural science specimens that is a principal aim of DiSSCo. They are also needed to support the emergence of widely usable domain tools enabling work with and on specimens information from any collection. NSIds must also work alongside existing heterogeneous and collection/institution-specific identifiers of physical specimens.

Digital transformation of collections is already underway, thus predating the construction and operation of DiSSCo infrastructure. Institutions have their existing systems for managing their collections. This is often a computerised database catalogue of the collection holdings – the Collections Management System (CMS) – that among many pieces of information uses a specimen level identifier, such as a Catalogue number to track specimens and make links between the database records and the physical objects themselves. Such identification and cataloguing schemes are usually proprietary to individual institutions, and even across departments within a single institution different schemes can be in use and, of course these can change over time. Many institutions are working to phase out such differences, replacing them, for example by a unique barcode when an object enters a collection or when it is first ‘digitised’. Unique digital identifiers might also be created when the data already in a CMS is made public through a data portal.

An established community effort to harmonise specimen identifiers at the collections level is that of CETAF Stable Identifiers. CETAF is the Consortium of European Taxonomic Facilities, a membership organisation for all kinds of natural science institutions and collections that exists to promote training, research and understanding in systematic biology and to facilitate access to collections. It is one of the principal stakeholders in the DiSSCo initiative.

CETAF Stable Identifiers are an identification system based on using HTTP-URIs to identify individual collection specimens as well as its associated information resources (e.g., multimedia, RDF, webpages). Typically chosen and maintained by the institution owning the specimen, these identifiers are normally composed of the institution’s domain name, a meaningful subdomain, a path to classes of similar objects, and local specimen identifiers such as the specimen’s barcode. As CETAF points out, physical specimens cannot be transferred via the Internet, so users trying to access a specimen via its CETAF Stable Identifier using a web-browser will be redirected to a human-readable representation of the specimen, typically an HTML ‘landing page’. Software-systems requiring machine-processable representations of the specimen information are redirected to an RDF-encoded metadata record.

NSId can also be an HTTP URI. When an NSId is appended to the URL of a resolution service it becomes an HTTP-URI. For example, appending the NSId  nn.xxxx/suffix to the URL of the handle.net resolution service yields the HTTP-URI https://hdl.handle.net/nn.xxxx/suffix. This makes NSId compatible with other kinds of HTTP-URI, such as CETAF Stable Identifiers.

NSId and CETAF Stable Identifiers can co-exist alongside each other but they each perform distinctly different roles. Just as individual journal publishers need to be able to publicise, track and manage the articles they publish, so individual natural science collections need to publicise, track and manage their own holdings. Proprietary identification systems and CETAF Stable Identifiers are well suited for that on the level of a specific institution. One institution does not have to consider the method used by another institution. On the other hand, users are well served when they do not need to know the specifics of how each individual publisher or institution designs and uses identifiers. NSIds are necessary for presenting and accessing a Europe-wide DiSSCo virtual collection through a Common European Collection Objects Index/Catalogue. They can be used further beyond on the global level. But NSIds do not replace or preclude the continuing use of local mechanisms. They offer additional value. Like DOIs, new services valuable to users can be offered in association with NSIds that are just not possible with local mechanisms. As NSId become established, we can expect such services to emerge. Already we foresee new services such as ELViS (European Loans and Visits System), UCAS (Unified Curation and Annotation System) and CDD (Collections Digitisation Dashboard). Together, such services can usher in a new era of working with natural science collections.

How NSIds will work in practice

Many details of how NSIds will work and be managed in practice still must be worked out and this is ongoing work in Work Package 6 of the ICEDIG project until mid-2019. Details still to be addressed include, for example:

  • Deciding who will run local handle services for DiSSCo, which will probably involve running several primary servers (for load sharing), with secondary servers at multiple sites for resilience (replication and backup). This configuration must be designed.
  • Saying how institutions will become involved. What do they have to do? How do they become a registrar of Digital Specimens.
  • Designing phase 1 of a ‘DSPublisher’ component to construct/mint NSIds from information provided by institutions.

Nevertheless, the need for NSId to facilitate digital transformation of natural sciences is beyond doubt.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s