Knowledge Graphs for Data Integration - UHasselt 2023
Network event summary
In June 2023 I attended a workshop in Hasselt, Belgium, organized by a research network on Knowledge Graphs and Data Integration. Linked Data-based systems and declarative SPARQL queries somehow magically do all the work efficiently, but how? For many people, presentations about using Linked Data and SPARQL already seem very technical. But this event went further, it covered the fundamental algorithms that power the Linked Data systems! In this post I try my best to briefly summarize some of the talks and highlight the added value of those fundamental and often behind-the-scenes work.
The FWO-founded research network Knowledge Graphs and Data Integration organized a workshop at the Data Science Institute of Hasselt University in Belgium. As part of this workshop, a keynote about a logical approach to model interpretability was given by Marcelo Arenas from the Pontificia Universidad Catolica de Chile. Afterwards, several papers were presented in three different sessions: foundations, experience and exemplars, and processing and mining. I will group my summaries in these categories as well.
My overall takeaway is that there is still plenty of research for Knowledge Graphs! Not only for how to use them or how to encode knowledge in them, but also for how to actually implement Knowledge Graph software based on formally sound techniques.
Before I focus on the keynote itself I would like to start with an example. Recently in a show of the Flemish comedian Lieven Scheire, I heard about a Machine Learning (ML) algorithm that was trained to distinguish images from wolves and from huskies. The algorithm seemed to perform very well, but not always. Later, by trying out a few things, it turned out the algorithm apparently “learned” that the distinguishing factor is a white background (snow). So if there was white background, the animal on the image was classified as a husky image by the ML model.
This is just an example to illustrate that we don’t actually know what a ML algorithm/model has learned and thus why a certain output is generated. That’s why such models are usually called a black box. But in a world in which more and more ML enters our daily lives, it is absolutely fundamental that we are able to explain why the model behaves in a certain way.
Explaining predictions of ML models
In simple terms, the keynote of Marcelo Arenas presented some fundamental research in early stages that tries to specify a declarative query language (think of SQL or SPARQL) for ML models. The keynote lecture had some very nice examples to illustrate the graph algorithms. But why graphs actually?
Machine Learning models are usually based on so-called “neural networks” that are inspired by neurons in the human brain. These neural networks that may consist of several layers of neurons “learn” patterns by adjusting how information flows through the connections between the nodes. This is not a Knowledge Graph in a Linked Data-based sense in which you explicitly encode information that you can query later. But it is a graph and hence one can perform graph algorithms that include some sort of queries.
A logic-based query language for ML models
In his talk, Marcelo explained how they define theoretical logics for operations on such graphs. Their overarching goal is to develop a declarative query language, such as SPARQL, but for neural networks. This is not purely theoretical as they also implemented their “FOIL” logic using a special logic programming language that is different to languages such as Python or Java that you may know. Like this the theories can be tested with real life data and it can be observed how effective they are with specific data (besides the theoretical complexity). If you are interested in more, here you can find a preprint of an earlier paper: https://arxiv.org/abs/2110.02376
From this session, I will focus on three of the talks. Two that focus on effective computations and one about reliable query languages.
The work of Jeroen Bollen is motivated by the fact that memory is often a bottleneck for so-called Graph Neural Networks. There are ways to compress the graphs. But so far there haven’t been any formal proofs showing that the accuracy of the algorithm remains stable with differently compressed graphs.
In his talk, Jeroen presented his work where he showed that they could minimize computational and memory requirements while keeping a similar accuracy. Very interesting work that demonstrates that fundamental research can make a difference!
Other more formal work inspired by a real world use case was presented by Robin De Vogelaere. You probably have used glue (a cohesive) at least once in you life. Selecting the right adhesive in the right situation is not trivial. It depends on the material and in which conditions the product will be used (heat, shock, etc).
Robin presented the FO(.) first-order logic that they use in their IDP/Z3 Knowledge Base system to perform reasoning (on cohesives). In his presentation he highlighted why the Knowledge Graph Community should care (see image below), A perfect example of the knowledge transfer the workshop aimed for!
Reliable query languages
Staying in the Linked Data RDF world: You might have used SPARQL already to query something from Wikidata via the Wikidata Query Service. But using some baked-in SPARQL functionality, you can actually write a federated SPARQL Query that fetches different parts of the data from different servers.
But what if one crucial FILTER part of the query relies on the data of a server that is currently offline? Should the whole query fail or would you accept data from the other servers that could not have been filtered (and which are wrong for your use case)?
Tim Baccaert presented some early work on exactly those problems. He is looking into different ways to make query languages more reliable. For example by providing better machine-understandable error messages such that clients can interpret partial results better.
This is one way of dealing better with the reality where servers are down sometimes. Another way is Linked Data Fragments (LDF): Instead of processing the whole possibly complex SPARQL query on the server, the LDF server only accepts simple patterns and the computation of the result is done on the server and the client.
Experience and exemplars
This session is about experiences in setting up Knowledge Graphs and in using the data.
Understanding relationships between attributes of relational data is important to integrate data in a meaningful way. Marcel Parciak presented work on detecting (approximate) functional dependencies in data.
By performing a literature review before their experiments, they could identify 12 measures to detect possible dependencies. They performed an evaluation of the different measures on benchmark datasets. Before doing that they also had to annotate the benchmark data because there were no labels for when there is a functional dependency and when not.
Larissa C. Shimomura is researching another kind of dependency: Graph Generating Dependencies. Such dependencies are helpful to understand the data and possible correlations between different attributes. In particular, Larissa investigates reasoning on property graphs.
Declarative solutions following common standards are the goal when working with Knowledge Graphs in RDF. However, especially in the beginning of a project the requirements are often vague and different stakeholders need to build a common understanding.
In his talk, Christophe Debruyne presented work about the TOXIN Knowledge Graph. The aim of this Knowledge Graph is to describe existing safety data about cosmetic ingredients. It therefore contributes to non-animal systemic toxicity assessments.
In the beginning phase of the project many things were set up with Python. This helped in getting a shared understanding between the stakeholders before setting up a complete infrastructure. In later stages, R2RML and other declarative tools such as SHACL for quality assessments were used.
Technology stacks Knowledge Graph
While talking about infrastructure, the technology stacks of cloud platforms are quite complex. This means that switching between providers can become really cumbersome.
In his presentation, Johannes Härtel talked about how they used RDF to describe technology stacks. Furthermore, they have implemented a tool that harvests data about the software stacks from GitHub repositories.
For example, if you have a docker file, the script detects this and will add to the Knowledge Graph that you at least use YAML files and that you have at least one docker setup provided.
The goal of their work is to allow for a smoother switch between technology stacks. For example, to identify Open Source technologies that fit the current (proprietary) technology stack you are using. At the core, they used Apache Jena to store RDF. Johannes listed four reasons for why to use Knowledge Graphs (see image below).
Law as code
The transparency of legislation and the accountability of governments are very important for a democracy. In his presentation, Tom De Nies presented the tools Themis and Kaleidos which are used to support the legislation process in Flanders, Belgium with Linked Data.
The clue? There is no data conversion needed, their system works with RDF at its core! Data is stored, accessed and published from a RDF triple store. Input forms and other website components are created by using an EmberJS frontend that relies on a rich microservice infrastructure in the backend.
Processing and mining
Last but not least, the last session focused on mining information and effective query execution.
Mining software engineering patterns
Yunior Pacheco Correa presented work on mining software engineering patterns from code repositories. This can help to support a developer with auto-completion when writing code, such as GitHub Copilot. Another possibility would be to query the mined patterns to for example look for security vulnerabilities.
More efficient SPARQL queries
Three presentations from the IDLab KNoWS group focused on the behind-the-scenes of querying with SPARQL. SPARQL is a declarative query language: the user specifies what to retrieve and not how. Internally, the used SPARQL query engine processes the query and builds a query plan. For example, it determines how to join result patterns.
Bryan-Elliott Tam presented work related to guided link traversal query processing. So basically, how a RDF query interface can already provide hints about its data via the TREE hypermedia specification, such that the query engine can make better choices in query planning. Similar research was presented by Jonni Hansi.
Bryan’s research has shown that with SPARQL FILTER expressions and the TREE specification, they were able to prune links to Linked Data fragments that will certainly not contribute to the SPARQL query. This means the query will be faster!
Ruben H. Eschauzier presented early work on a ML model to learn effective joins in query planning. As a modular part of the Comunica query engine, the presented model can be used to predict the execution time of a join and tries to select the most efficient one. Currently this model only outperforms the Comunica templates in 7 out of 18 templates. There is thus still more research needed!
More on query optimization
The research of Wilco van Leeuwen centers around cardinality estimation: a technique used to estimate the number of items returned by a query or subquery. He is interested in obtaining the most accurate answers using the least amount of resources (time and space).
Compared to relational data, graph data introduce additional challenges for cardinality estimation. His research helps understanding different cardinality estimation techniques and their trade-offs. This could for example be applied to improve query planning on graph data, such as for SPARQL query planning.
Thanks to the organizers for planning this event and to the researchers to present their work! I think this has been a fruitful exchange of knowledge. I am looking forward to the following events.
Don’t forget to share this post on social media and in your networks!
Please also consider subscribing to my bi-weekly newsletter FAIR Data Digest to receive more interesting content every other Tuesday!