Recently I had the opportunity to give a presentation about global research infrastructures for biodiversity and ecosystem science to staff of the Institute of Biodiversity and Ecosystems Dynamics (IBED) at the University of Amsterdam. I’ve been working with some of the people there for most of the last 9 years. It’s been one of the most satisfying periods of my career.
The talk I gave (slides here on slideshare.net) introduces the idea of research infrastructures and sets out the landscape in Europe and at the global level in biodiversity science and ecology. I gave some examples and also explained the important role that virtual laboratories play in research infrastructures. My talk highlighted that in setting off down the road of adopting Essential Biodiversity Variables (EBV) we have found a unifying global use case we can use to drive forward work to resolve interoperability problems between the different research infrastructures.
First, I set the scene by reminding everyone that we’re in the era of data-intensive science now.
Whether satellite tagging of wildlife; genomic sequencing of sea water samples for Ocean Sampling Day; quantifying biodiversity and ecosystem service (ES) responses to the combined effects of climate and land use changes in tropical South America; better understanding of Honey bee (Apis mellifera) and parasitic mite (Varroa destructor) microbiomes and the relation between them; or monitoring forest micro-climates, the common denominator is large amounts of complex data. This complex data is not necessarily nor always “Big Data” but it may sometimes be.
Research infrastructures help us with such data intensive science. We’re familiar with the idea of infrastructure in other contexts, transport for example. Research can have infrastructure too. Put simply, it’s anything semi-permanent that researchers need to do their daily jobs. That includes physical equipment such as laboratoriess, instruments, survey sites, ships, etc; computers, whether they be desktop, mobile, HPC, or e-Infrastructure; and people i.e., capacity, competencies, training, help and support.
Research Infrastructures support the scholarly cycle. On the one hand, they help us organise and carry out our work to create new knowledge. On the other they help us to teach that knowledge to a new generation of students and others wanting to learn it. Of course, the scholarly cycle of research and teaching involves data, which also has a lifecycle of its own.
I should mention here that I use the term data in its widest possible sense. I mean not only data collected through observations or from experiments but also to mean all of the associated data arising from the management and processing of that primary data. This includes things like software, workflows, documentation, and provenance information. So, as I talk about data, keep in mind that it’s with the widest interpretation.
Thus, research infrastructures also support the data lifecycle. In fact, they are optimised towards it through identification of the many different kinds of functions associated with or acting on data. At the simplest level we see 5 major segmentations of function that we could expect to find in a typical research infrastructure. These are data acquisition, data curation, data discovery and access, data processing and (surrounding them all) community support. You can see how these map to each of the steps of the data lifecycle. Of course, there are many more levels of decomposition and details than shown here but this illustrates the idea. These details are being practically explored in the wider context of research infrastructures in Europe for environmental sciences.
Returning specifically to infrastructures supporting research on the biosphere, we find LifeWatch, ANAEE, and ELIXIR for example. How do these relate to each other? And how do they relate to similar research infrastructures in other parts of the world?
Let’s divide the landscape on a spectrum. We can see fundamental reference data, data collected from observations and through monitoring; experimental data and analysis and modelling with data. That’s a spectrum from data mobilization to knowledge production into which many projects and initiatives can be situated. There are many players, many stakeholders. The ESFRI infrastructures – LifeWatch, ANAEE, ELIXIR, etc. are shown in yellow. You can see there are also some from the marine quadrant as well, as well as many past and recent European funded projects. I’m not listing them all here at the moment but most are easy to track down and find with a search engine.
I will give just a few examples though. Actually, they’re all component parts of infrastructure because (as I remarked when I gave the talk) the way that infrastructure emerges is bottom-up from the community, and not usually top-down imposed according to some grand scheme. True, there can be guiding principles and plans but often it’s the result of work by multiple stakeholders and players that causes islands of infrastructure to emerge, grow and fuse. I first wrote about this when I was carrying out the preparatory phase work for the planning of the LifeWatch infrastructure.
Back to the examples. DEIMS is the ecological information management system of the long-term ecological research sites network in Europe, LTER-Europe. There you can find an increasing number of datasets being made available from more than 800 LTER sites. Scratchpads, with 650+ individual sites and more than 6000 users provides the necessary tools to allow individual and groups of scientists to mobilise and link their own specialist biodiversity data. Catalogue of Life is an authoritative index of the World’s species. Currently it indexes 1.6 million taxa, representing approximately 84% of those known and classified. Biodiversity Catalogue is a specialist catalogue of Web services front-ends to a wide range of reputable computerised data and analytical or processing resources for biodiversity and ecological research. These services can be accessed from workflows, Rscripts, Python programs and other software applications as desired.
These examples are just a few of the multiple elements that go to make up a typical research infrastructure that can comprise sensor resources, data repository and database resources, processing and analytical tool resources, and mechanisms for integrating them in meaningful ways. That latter is often described as composition – into workflows and other software programs and applications. In a community driven e-infrastructure, like LifeWatch where all these elements are brought together, groups of scientist users can create their own e-laboratories or e-services within a common architecture of the infrastructure. They share their data and algorithms with others, while controlling access. A community driven infrastructure promotes innovation. I’ll come back to this community ownership aspect below when talking about the emergence of open science commons.
I want to dwell for a minute on the role that physical laboratories play in science. We’re all familiar with those. Chemistry laboratories, for example are equipped with everything you need – glassware, instruments, chemicals to carry out whatever kind of chemistry. Some labs are general purpose. Others are much more specific; either thematically oriented, or equipped for a very specific research goal.
Just as physical laboratories are an essential element of physical research infrastructures, so too are virtual laboratories an essential ingredient of research infrastructures having a strong e-component. Virtual labs, existing in cyberspace on the Worldwide Web / Internet are the equivalent idea for working with data – the virtual chemicals, so to say. We already have several examples, including: the alien species showcase lab in Italy; the Swedish analysis portal; the UvA BiTS lab for bird movement modelling; the iPlant Collaborative in USA; the VRE being established for the marine research community by VLIZ and partners; and my own favourite, Biodiversity Virtual Laboratory (BioVeL). You can find that one at https://portal.biovel.eu/ and I encourage you to explore it and try it out. It’s a general purpose virtual lab supporting several different kinds of biodiversity and ecological research. You can get support by emailing to email@example.com. Virtual labs are a very obvious manifestation of the new kinds of research infrastructure we are talking about. I’ll stick my neck out and say that these are future environments in which modern scientists will increasingly find themselves.
As well as the European picture of research infrastructures, we can add another dimension with the global picture. There are RIs in other parts of the world. Examples include: the Global Biodiversity Information Facility (GBIF), DataONE in the USA, the Atlas of Living Australia, SpeciesLink and SiBBr in Brazil, SANBI Integrated Biodiversity Information System (SIBIS) in South Africa, the World Federation for Culture Collections (WFCC) and GEO BON. We in Europe have been cooperating with these during the last 4 years in the CReATIVE-B project. How do they relate to one another and to the European activities?
Note that there are not many facilities yet focussing on data processing and analysis for knowledge production. There are emergent possibilities though, and I’m thinking particularly of the synthesis centres like NCEAS (USA), ACEAS (Australia), sDIV (Germany), CESAB (France), and EOS (UK), or our own virtual labs that I mentioned before and which we establishing here in Europe.
“Flock together”, the CReATIVE-B roadmap outlines 5 priorities for the next decade, agreed by the major RIs around the World. These are: i) increased global coordination and common technical interoperability between RIs; ii) priority for discovery and access; iii) data and tools for research, management and conservation; iv) effective governance for legal interoperability; and v) acting as a broker for scientists, policy and citizens. Based on work of the CReATIVE-B project in cooperation with all the RIs around the World there is a recommended joint action plan on Essential Biodiversity Variables (EBV) which we will be pursuing in the recently started (1st June) GLOBIS-B project. For the first time we have found a unifying use case that can drive interoperability work at the global level to eventually provide a globally integrated “International Virtual Environment for Biodiversity Science and Ecology”.
However, there is much to do still and one of the major challenges is that no-one nor any single organisation is responsible for infrastructure. As I highlighted above, infrastructure emerges simultaneously in multiple places. As it grows there’s a need to coordinate it. Particularly, this involves technical coordination through standards and other rules that everyone adheres to. It can also involve administrative / political coordination by governance. But implementation, and sustaining that implementation takes place at a local level. Thus, we’re all (or at least many of us are) responsible for infrastructure, collectively.
The Internet / World-Wide Web is a good example. No-one owns the Internet but there’s a set of rules and standards that govern it and various coordinating bodies to apply those rules. Organisations and individuals wanting to be part of the Internet investing their own resources in routers, mail and web servers, and the skilled personnel to maintain them. Hardware is a general-purpose commodity item. The standards and the software implementing those standards is often freely and openly available – open source for anyone to use. It is in fact a “commons”, available to all.
We should increasingly be aware of is initiatives towards an open science commons.
“Open science” is all about opening the process of knowledge creation and dissemination to a multitude of stakeholders, including society in general. It’s a movement that’s gained real momentum in the last 2-3 years as a result of, for example work by the Royal Society on science as an open enterprise, and as part of the European Commission’s open access agenda. Similar agendas are being pursued around the world now.
“Commons” are the (for example, cultural and natural) resources that are available to all in society without discrimination. Wikipedia is another good example. Often, commons are community governed mechanisms. In the context of open science commons, we are often talking about research infrastructures with limited resources. Governance reinforces the need of sharing in a way that allows non-discriminatory access, whilst also ensuring adequate controls to avoid congestion or depletion when the capacity is limited. These commons are a backbone of federated services for computing, storage, data communication, knowledge and expertise and the related community specific capabilities e.g., for biodiversity and ecosystems science.
This fits well with our vision for LifeWatch in Europe where we aim to deploy a spatial data infrastructure, well suited to the specific needs of our community on top of what I call the foundational distributed computing infrastructures as represented by EGI.eu, EUDAT, PRACE, as well as commercial providers. This approach is intended to eliminate many of the technical barriers that we see today that stand in the way of scientists using e-Infrastructures for their research work.
If you want to read more about what’s needed in the specific domain of informatics and e-infrastructure to support biodiversity and ecology research, then please look at my work with collaborators on a decadal view of biodiversity informatics: challenges and priorities, published in BMC Ecology in 2013 and the follow-up article by Peterson et al., including my right of reply response published in the same journal just last week.
So, what are the take home messages concerning research infrastructures from this article?
Firstly, I hope that I illustrated with sufficient examples and particularly with reference to the example of virtual laboratories that research infrastructures are here to help the pursuit of data intensive science and to support collaboration between scientists.
Secondly, I think that through the work of the CReATIVE-B project and its roadmap we have agreed globally that measuring and calculating EBVs on a global scale can act as a unifying use case to drive forward work on interoperability across Research Infrastructures. We will continue that in GLOBIS-B project.
Third and last, I’ve suggested that no-one is responsible for research infrastructure. We’re all collectively responsible for research infrastructure and each of many of us has a good motivation to contribute towards it emergence and growth.
Finally, I want to present you with current thoughts concerning my understanding of how research infrastructures are evolving.
Research infrastructures are no longer just about physical assets like instruments, labs and computers. This is the traditional and simple way that most of us like to think about infrastructure. I first saw a more complex thinking expressed in the context of meteorology and climate modelling, (see the book “A Vast Machine” by Paul Edwards). I see it also expressed through the philosophies of virtual observatories for astronomy and in the opening up of government public data. Most recently I see it expressed through virtual and genomic observatories in ecology. It is this: Accessible information – particularly databases in machine as well as human readable formats, coupled with techniques (e.g., KDD – Knowledge Discovery in Databases) allowing us to understand what information is contained within the data such that we can extract it; creates new opportunities for new research. Actors other than the data owners can exploit these resources. They can create and re-shape the tools that people use to find and leverage this data. A notion is taking hold, that the content of interconnected and interoperable repositories, together with data streams from distributed sensors and networks of sensors is itself the infrastructure. When we think like this perhaps we can see more clearly how to proceed?
I gave a bit of a talk on that in the context of data accessibility and the role of informatics in predicting the biosphere. There I drew attention to the journey we’re embarking on to deal with our community’s grand challenge of being able to model and predict the biosphere, highlighting some of the data issues, computational challenges and legal concerns we may encounter on the way.