Data to Knowledge in Agriculture and Biodiversity – D2KAB
Data to Knowledge in Agronomy and Biodiversity (D2KAB)
D2KAB implements processes to extract and formalize knowledge – semantically rich, interoperable, open – from agronomy/agriculture and biodiversity/ecology data (data to knowledge). The project also studies scientific methods and tools to exploit and disseminate this knowledge in different scenarios in agriculture or biodiversity.
Use of Semantic Web and linked data technologies to “transform” data on the major challenges of agronomy and biodiversity into reusable and actionable knowledge.
Agronomy and biodiversity research communities are facing several major societal, economical, and environmental challenges. However, data are being produced in such a big volume and at such a high pace, it questions our ability to transform them into actionable and reusable knowledge. We adopt in D2KAB an interdisciplinary approach of data science and semantics to provide means – ontologies, knowledge graphs – to produce and exploit FAIR data (Findable, Accessible, Interoperable, and Re-usable). To do so, we develop original methods and algorithms to address the specificities of our domain of interests, but also rely on existing tools and methods in the Semantic Web area. D2KAB brings together a multidisciplinary (and international) consortium of three computer science laboratories (UM-LIRMM, CNRS-I3S, STANFORD-BMIR), four applied informatics labs in agronomy or agriculture (INRAE-URGI, INRAE-MaIAGE, INRAE-IATE, INRAE-TSCF), two labs in ecology and ecosystems (CNRS-CEFE, INRAE-URFM), INRAE’s open science department open science department (INRAE-DipSO) and and one association of agriculture stakeholders (ACTA). IRD is also a collaborator, as well as the SME Elzeard and the French Wine and Vine Institute (IFV). The consortium’s informatics expertise ranges from ontologies and metadata, semantic Web, linked data, ontology alignment, knowledge reasoning and extraction, natural language processing to bioinformatics. Our application scenarios are related to food packaging, wheat phenotyping data integration, semantic exploitation of Plant Health Bulletins, the management of ecosystem data and the analysis of plant trait/environment relationships.
The project is structured with three work-packages of research and development in informatics and two work-packages of driving scenarios.
WP1 focuses on ontologies/ vocabularies and develops AgroPortal to make it an international reference platform for sharing and serving semantic resources in agri-food. We also exploit natural language processing methods.
WP2 focuses on the critical issue of ontology alignment and linking of semantic resources driven by the project use cases.
WP3, starting from the heterogeneous data provided by the scenarios, develops the methods and deploys the means necessary for the construction of a distributed and federated knowledge graph for agronomy and biodiversity and its exploitation by innovative modes of visualization, navigation and research.
WP4 includes four driving scenarios in agronomy/agriculture. For example, a first development concerns the design of an ontology-based decision support system to either formulate a bio-sourced composite biodegradable packaging or select the most appropriate food packaging for a use. Another example concerns the development of an augmented semantic browser for Plant Health Bulletins (with a focus on cereals, vines (in partnership with IFV), market gardening (in partnership with Elzeard)) capable of searching a set of bulletins while displaying additional sources of information (weather archive, etc.). We also participate in the development of a unique scientific knowledge base for wheat phenotypes which is used by the international wheat information system WheatIS.
WP5 develops semantic resources allowing the annotation of data for experimentation on ecosystems on the one hand and for observations in functional biogeography on the other. An example combining data sources relating to community ecology, plant traits and environmental factors is underway to understand the effects of climate change on vegetation (especially olive) in the Mediterranean Basin.
The D2KAB project has enabled significant progress in transforming data into knowledge in the fields of agronomy and biodiversity. The AgroPortal platform, at the heart of this initiative, has been enriched with new functionalities (SKOS management, instances, multilingualism, visualizations, etc.), while maintaining a catalog of 200+ semantic resources, including some specifically designed in the project (e.g., PPDO, ANAEEONTO, BAGO, etc.) and some existing resources (e.g., CROPUSAGE, ANAEE Thesaurus, TOP Thesaurus, etc.) that were enriched and made FAIRed during D2KAB. These efforts were supported by an innovative method for FAIRness assessment of semantic resources, called O’FAIRe, implemented in AgroPortal and later generalized to other ontology repositories within the OntoPortal Alliance, thus contributing to similar challenges in other scientific communities. The project also experimented with the SSSOM model to represent alignments between ontologies and participated in discussions to evolve it.
D2KAB also produced knowledge representation models and multiple RDF knowledge graphs, integrating data from the project scenarios: plant health bulletins, meteorological observations, ecosystem data, scientific annotations on wheat (genes, traits, phenotypes) and agro-industrial itinerary data. An experimental federation of these graphs was implemented via distributed SPARQL access points, allowing complex queries on several sources. These graphs and their interconnected data are made available to scientific communities offering interoperability and simplified data exploitation for applications in agronomy and biodiversity. The project also invested in automatic language processing to structure and extract knowledge from textual corpora. Integrated pipelines were developed to annotate Plant Health Bulletins (species, phenological stages, weather) and a scientific corpus on soft wheat (varieties, genes, traits). In parallel, work on data linking in AgroLD and the hybridization between semantic methods and machine learning (with Elzeard) have opened new perspectives for exploiting knowledge graphs.
These results demonstrate the direct impact of D2KAB for the structuring of data, their integration and their provision in the form of actionable and interoperable knowledge, contributing to agronomic and environmental research and the digital transition of life sciences.
The D2KAB project has established solid foundations for data management, integration and exploitation in the fields of agronomy and biodiversity. Several development axes are emerging to capitalize on these achievements. The AgroPortal platform, the central engine of the project, will continue to evolve with enhanced functionalities for ontology curation, metadata harmonization and semantic alignment. These advances, initiated in D2KAB, are already and will continue to be transferred to other communities, notably via the EOSC FAIR-IMPACT project. In addition, the extension of the @Web platform, used in the ANR EVAGRAIN, demonstrates the concrete integration of D2KAB tools, for example for the quality control of wheat data and the development of predictive agri-food models. The RDF knowledge graphs developed in D2KAB illustrate the importance of structuring and integrating complex data. They pave the way for solutions such as indexing, interactive visualization and federated query on distributed SPARQL points. While AgroPortal is now the reference for publishing ontologies/semantic resources in agriculture, sustainable solutions for storing and sharing knowledge graphs in the long term still need to be defined.
In terms of interoperability, the evolution of the AgroPortal/OntoPortal alignment model towards full compatibility with SSSOM is a major focus. This includes the development of services to share and document these alignments, while making them FAIR thanks to rich metadata (provenance, justifications) and interoperable with other repositories. This work is based on initiatives such as the RDA “FAIR mappings” group and the European EOSC FAIR-IMPACT and FAIRCORE4EOSC projects.
A key challenge will be the integration of automatic language processing workflows within AgroPortal, allowing to couple tools such as AlvisNLP with semanticization services. While the project has not revisited some historical tools, such as AgroPortal's Annotator, in light of recent advances in automatic language processing (e.g., large language models), these technologies offer promising perspectives to improve the automation and accuracy of annotations.
The concrete applications of the project, such as the transformation of plant health bulletin data or the annotation of experimental data on wheat, demonstrate the potential of D2KAB to convert data into knowledge. These results will serve as a basis for other areas, such as problem identification in AgroLD, used in the DACE-DL and DIG-AI projects.
Finally, D2KAB contributes to open science through the promotion of FAIR data and semantic resources, and participates in the development of researchers’ skills on these subjects through training on the tools and standards developed.
D2KAB has produced around thirty scientific publications, a dozen semantic resources, several datasets in RDF or other standard formats and numerous components or new open source software. More details at www.d2kab.org
D2KAB is involved and associated with multiple actions and dissemination/communication/training events where we use our scenarios as demonstrators of the potential of semantic technologies in agronomy and biodiversity.
Agronomy and biodiversity shall address several major societal, economical, and environmental challenges. However, data are being produced in such big volume and at such high pace, it questions our ability to transform them into knowledge and enable, for instance, translational agriculture i.e., rapidly and efficiently transferring results from agronomy research into the farms (“bench to farmside”). Semantic interoperability enables data integration and fosters new scientific discoveries by exploiting various data acquired from different perspectives and domains.
D2KAB’s primary objective is to create a framework to turn agronomy and biodiversity data into –semantically described, interoperable, actionable, open– knowledge, along with investigating scientific methods and tools to exploit this knowledge for applications in science and agriculture. We will adopt an interdisciplinary semantic data science approach that will provide the means –ontologies and linked open data– to produce and exploit FAIR (Findable, Accessible, Interoperable, and Re-usable) data. To do so, we will develop original approaches and algorithms to address the specificities of our domain of interests, but also rely on existing tools and methods.
D2KAB involves a multidisciplinary (and international) research consortium of three computer science labs (UM-LIRMM, CNRS-I3S, STANFORD-BMIR), four bioinformatics, biology, agronomy and agriculture labs (INRA-URGI, INRA-MaIAGE, INRA-IATE, IRSTEA-TSCF), two ecology and ecosystems labs (CNRS-CEFE, INRA-URFM), one scientific & technical information unit (INRA-DIST), and one association of agriculture stakeholders (ACTA). The consortium’s expertise ranges from ontologies and metadata, semantic Web, linked data, ontology alignment, knowledge reasoning and extraction, natural language processing to bioinformatics, agronomy, food science, ecosystems, biodiversity and agriculture.
The project is structured with three work-packages of research and development in informatics and two work-packages of driving scenarios. WP1 will focus on ontologies/ vocabularies and turn the AgroPortal prototype into a reference platform that addresses the community needs and reaches a high level of quality regarding both content and services offered e.g., SKOS compliance, semantic search over linked data, text annotation, interoperability with other repositories. WP2 will focus on the critical issue of ontology alignment and develop new functionalities and state-of-the-art algorithms in AgroPortal using background knowledge methods validated in ag & biodiv. WP3 will design the methods and tools to reconcile the scenarios' heterogeneous ag & biodiv data sources and turn them into linked data within D2KAB distributed knowledge graph. It will also investigate exploitation of this graph through novel visualization, navigation and search methods.
WP4 includes four interdisciplinary research driving scenarios implementing translational agriculture. For instances, an ontology-driven decision support system to select the most appropriate food packaging or an augmented semantic reader for Plant Health Bulletins. We will provide a unique scientific knowledge base for wheat phenotypes and offer the first agricultural data resource empowered by linked open data. WP5 will develop semantic resources for the annotation of ecosystem experiments data and functional biogeography observations. A plant trait-environment-relationships study will be conducted to understand the impacts of climatic changes on vegetation of the Mediterranean Basin.
Within a dedicated work-package, we will focus on maximizing the impact of our research. Each of the project driving scenarios will produce concrete outcomes for ag & biodiv scientific communities and stakeholders in agriculture. We have planned multiple dissemination actions and events where we will use our driving scenarios as demonstrators of the potential of semantic technologies in agronomy and biodiversity.
Project coordination
Clement Jonquet (Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
INRA-URFM Ecologie des Forêts Méditerranéennes
UM-LIRMM Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier
STANFORD-BMIR Stanford University / Stanford Center for Biomedical Informatics Research
CNRS-I3S Laboratoire informatique, signaux systèmes de Sophia Antipolis
IRSTEA-TSCF Technologies et Systèmes d'Information pour les Agrosystèmes
CNRS-CEFE Centre d'Ecologie Fonctionnelle et Evolutive
ACTA ASSOCIATION COORDINATION TECHNIQUE AGRICOLE
INRA-DIST Délégation Information Scientifique et Technique
INRA-MaIAGE Mathématiques et Informatique Appliquée du Génome à l'Environnement Unité de recherche
INRA-URGI Unité de Recherche Génomique-Info
INRA-IATE Ingénierie des Agropolymères et Technologies Emergentes
Help of the ANR 971,180 euros
Beginning and duration of the scientific project:
May 2019
- 48 Months