Introduction

Both knowledge graphs (KGs) and pre-trained language models (PLMs) have gained popularity due to their ability to comprehend world knowledge and their broad applicability. KGs are instrumental in applications like search engines, evident from Google’s Knowledge Graph. On the other hand, popular PLMs like BERT and GPT excel in a variety of natural language tasks. More recently, large language models (LLMs), a type of PLM, have taken the world by storm. PLMs are models pre-trained on large datasets and LLMs are essentially PLMs which are characterised by their larger size and complexity. For example, OpenAI’s GPT-3 is a LLM boasting 175 billion parameters. Throughout this article, we will interchangeably use LLMs and PLMs to refer to these models.

In our previous article titled “Automated Knowledge Graph Construction with Large Language Models,” we discussed the complementary relationship between KGs and PLMs to harness the strengths of both sides. KGs explicitly capture information about entities and their relationships, while PLMs contain implicit knowledge learnt from their vast and diverse training datasets. Combining them into a single model gives us knowledge graph enhanced pre-trained language models (KGPLMs).

KGPLMs can be categorised into three types: before-training enhancement, during-training enhancement, and post-training enhancement models. These are distinguished by the stage at which KGs are integrated into the pre-training phases of PLMs. We will delve deeper into these three models in the following sections.

KGPLMs are categorised into before-training, during-training, and post-training enhancement models. Source: Yang et al., 2023 https://doi.org/10.48550/ARXIV.2306.11489

Before-training Enhancement Models

These models convert text data and KG triples into the same format for input into PLMs. There are two challenges when attempting to do so: the heterogeneous embedding space and knowledge noise. The former refers to the incompatible formats of text and KG data, while the latter involves inserting unrelated knowledge that changes the intended meaning of the text. For domains with limited training data, these methods are effective in supporting PLMs, at the cost of increased computational resources, pre-training time, and the possible introduction of knowledge noise. Nonetheless, they can improve the reasoning capabilities of PLMs without augmenting PLM characteristics like model size. Below, we discuss four techniques utilised by before-training enhancement models.

1. Expand Input Structures

The structure of K-BERT, which converts input sentences to a graph structure and adds in supporting knowledge from KGs. Source: Liu et al., 2020 https://doi.org/10.1609/aaai.v34i03.5681

This technique converts text input into a graph structure which can be merged with an existing KG. The merged KG can subsequently be converted into text for PLM input or encoded in another way. An example is in K-BERT, which converts input sentences into sentence trees by injecting relevant knowledge from a KG through the knowledge layer. The sentence tree is encoded into tokens and a visibility matrix controls knowledge noise by preventing the influence of irrelevant neighbouring tokens. By changing the supporting KG in the knowledge layer, K-BERT performed well in specific domains like medicine.

2. Enrich Input Information

Given an input sentence “Beyonce lives in Los Angeles”, LUKE is trained to predict masked words (“lives” and “Angeles”) and entities (“Los Angeles”). It subsequently outputs a vector representation for each word or entity. Source: Yamada et al., 2020 https://doi.org/10.48550/arXiv.2010.01057

In this method, entity embeddings are combined with text embeddings in the model. The example shown above is of a transformer utilising LUKE (Language Understanding with Knowledge-based Embeddings). It is an extension of BERT, a masked language model that randomly masks tokens by replacing words with the [MASK] token, and encodes the tokens and positions of the given sentence. Since BERT only encodes words, but entities like “”Los Angeles” can span multiple words, LUKE was trained to mask and predict entities as well. In this case, input information was enriched to contain entities in addition to words.

3. Generate New Data

An example of a subset of ATOMIC, which focuses on if-then relations. Source: Sap et al., 2019 https://doi.org/10.1609/aaai.v33i01.33013027

Here, KG knowledge is used to generate synthetic text which is then injected into a PLM. For example, ATOMIC (ATlas Of MachIne Commonsense) is a dataset that focuses on capturing three types of if-then relations: If-Event-ThenMental-State, If-Event-Then-Event, and If-Event-Then-Persona. A PLM was trained to generate a target sequence given an event phrase and an inference dimension based on the KG. For instance, when prompted with “PersonX wins the title” and “As a result, X wants to”, the PLM could generate inferences about actions or consequences such as “celebrate”, “brag” and “congratulate themselves”.

4. Optimise Word Masks

Masked language models like BERT randomly mask words during training. However, there may be correlation between consecutive words that can be lost if one of the words is randomly masked. For example, if “Los Angeles” is masked and becomes “Los [MASK]”, the model might fail to understand semantics and treat “Los” and “Angeles” separately despite their obvious correlation. ERNIE (Enhanced Language RepresentatioN with Informative Entities) avoids this by identifying entities in text and aligning them with corresponding entities in KGs. The model was trained by randomly masking entities instead of single words and asked to choose appropriate entities from KGs to fill in the masked positions.

During-training Enhancement Models

While before-training enhancement models focused on incorporating KG data into text data, during-training enhancement models give PLMs knowledge during training by modifying their encoders or training tasks. This allows PLMs to learn from text and KGs simultaneously. Specialised information or modules can be added in for specific domains or tasks. However, these models usually have larger architectures, hence requiring longer training times. They are limited to the scope of the training data and are prone to overfitting due to larger and more complex architectures. During-training enhancement models are most beneficial when multiple complex tasks are required or for knowledge-grounded tasks. There are four possible approaches to achieving this.

1. Incorporate Knowledge Encoders

Models may incorporate various encoders for text, KGs, or a hybrid of these data formats to obtain embeddings for the model. For example, the previously mentioned ERNIE model uses a textual encoder (T-Encoder) and a knowledgeable encoder (K-Encoder). First, the T-Encoder captures lexical information from input tokens by encoding text similarly to BERT, utilising the token, segment and positional embeddings. The graph structure of KGs is encoded with algorithms like TransE. Both textual and graph encodings are given to the K-Encoder to fuse the information into a unified representation.

2. Insert Knowledge Encoding Layers

This method involves inserting layers into PLMs that inject knowledge from KGs. KnowBERT by Peters et al. (2019) achieves this by inserting a Knowledge Attention and Recontextualisation (KAR) mechanism between two layers in the middle of the BERT architecture. This additional layer takes in the contextual representations from the previous BERT layer, uses an entity linker to obtain relevant entity embeddings from KGs, and outputs knowledge-enhanced embeddings to the next BERT layer.

3. Add Independent Adapters

(a) PLMs consist of a series of transformer layers (TRM), meaning parameters for the entire model are re-trained for every type of injected knowledge. (b) Instead, each adapter in K-ADAPTER, consisting of adapter layers (KIA), is independent and retains the original PLM parameters as well. Source: Wang et al., 2020 https://doi.org/10.48550/ARXIV.2002.01808

Whilst other methods require PLM parameter updates, the K-ADAPTER framework seeks to preserve the PLM’s parameters to support the injection of multiple types of knowledge without entangling the resulting representations. This allows the effects of each knowledge type to be studied independently. These adapters, which are compact neural models, accept hidden states from the PLM’s intermediate layers as input, as shown in the diagram above. In K-ADAPTER, the authors aligned Wikipedia text to Wikidata triplets and injected this knowledge using such an adapter.

4. Modify the Pre-training Task

Rather than random masking, masked entity modelling may be adopted instead. The previously discussed ERNIE masks entities rather than individual words to capture entities that span multiple words. Other methods integrate knowledge representation learning to concurrently update knowledge representations and PLM parameters. For instance, the KEPLER (Knowledge Embedding and Pre-trained LanguagE Representation) model jointly optimises the knowledge embedding and masked language modelling objectives, by following existing PLM approaches as well as encoding entity descriptions as entity embeddings to represent relational facts in knowledge graphs.

Post-training Enhancement Models

Post-training enhancement is achieved by fine-tuning PLMs on additional data or tasks, which is a relatively low-cost and straightforward way of tailoring PLM outputs to domain-specific tasks. However, the data needs to be labelled and crafting prompts requires prior knowledge and external resources. Furthermore, prompts are designed to constrain the PLM’s generations, but may limit its flexibility during text generation. There are two methods used for post-training enhancement.

1. Fine-tune PLMs with knowledge

The framework of KagNet. Source: Lin et al., 2019 https://doi.org/10.48550/arXiv.1909.02151

In the example of KagNet shown in the above image, an initial question-answer pair is encoded by a PLM such as BERT. Additional knowledge is included in the form of a schema graph containing relevant knowledge to the given question and answer. This is retrieved from external knowledge graphs like ConceptNet. The graph is encoded by a knowledge-aware graph network module. Finally, KagNet generates a plausibility score, allowing the model to pick the answer with the highest score as the most likely answer.

2. Generate Knowledge-based Prompts

A knowledge-to-text framework for knowledge-enhanced commonsense question-answering. Source: Bian et al., 2021 https://doi.org/10.48550/arXiv.2101.00760

An example of this approach is found in the knowledge-to-text framework proposed by Bian et al. (2021). Given a question and a series of possible answers, the first knowledge retrieval step obtains relevant facts from a KG. These facts are transformed into text in the second knowledge-to-text transformation step via template-based, paraphrasing-based, and retrieval-based algorithms. Finally, the obtained text is fed into a PLM for it to predict the answer.

Conclusion

Summary of the types of KGPLMs, corresponding techniques and example models. Source: Yang et al., 2023 https://doi.org/10.48550/ARXIV.2306.11489

In summary, this article explored three types of knowledge graph enhanced pre-trained language models (KGPLMs): before-training enhancement, during-training enhancement, and post-training enhancement models. There were several methods described in each model type, as listed in the table above. We discussed the advantages and limitations of each type of KGPLM which determines the situations they are suitable for. However, these methods are definitely not mutually exclusive. For instance, the ERNIE model showcased the optimisation of word masks, incorporated knowledge encoders, and modified pre-training tasks. Although we only focused on a few models to illustrate these methods, numerous models utilise these techniques in diverse ways, as seen in the table above. There are also newer models that have not been included in the article that continue to expand the capabilities of KGPLMs and will continue to do so.

References

Yang, L., Chen, H., Li, Z., Ding, X., & Wu, X. (2023). Give Us the Facts: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2306.11489
Liu, W., Zhou, P., Zhao, Z., Wang, Z., Ju, Q., Deng, H., & Wang, P. (2020). K-BERT: Enabling Language Representation with Knowledge Graph. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, Issue 03, pp. 2901–2908). Association for the Advancement of Artificial Intelligence (AAAI). https://doi.org/10.1609/aaai.v34i03.5681
Yamada, I., Asai, A., Shindo, H., Takeda, H., & Matsumoto, Y. (2020). LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2010.01057
Sap, M., Le Bras, R., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof, B., Smith, N. A., & Choi, Y. (2019). ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, Issue 01, pp. 3027–3035). Association for the Advancement of Artificial Intelligence (AAAI). https://doi.org/10.1609/aaai.v33i01.33013027
Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced Language Representation with Informative Entities (Version 3). arXiv. https://doi.org/10.48550/ARXIV.1905.07129
Peters, M. E., Neumann, M., Logan, R. L., Schwartz, R., Joshi, V., Singh, S., & Smith, N. A. (2019). Knowledge Enhanced Contextual Word Representations (Version 2). arXiv. https://doi.org/10.48550/ARXIV.1909.04164
Wang, R., Tang, D., Duan, N., Wei, Z., Huang, X., ji, J., Cao, G., Jiang, D., & Zhou, M. (2020). K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2002.01808
Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J., & Tang, J. (2019). KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation (Version 3). arXiv. https://doi.org/10.48550/ARXIV.1911.06136
Lin, B. Y., Chen, X., Chen, J., & Ren, X. (2019). KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning (Version 1). arXiv. https://doi.org/10.48550/ARXIV.1909.02151
Bian, N., Han, X., Chen, B., & Sun, L. (2021). Benchmarking Knowledge-Enhanced Commonsense Question Answering via Knowledge-to-Text Transformation (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2101.00760

Catch the latest version of this article over on Medium.com. Hit the button below to join our readers there.

Learn more on Medium

A Deep Dive Into Knowledge Graph Enhanced Pre-trained Language Models

Introduction

Before-training Enhancement Models

1. Expand Input Structures

2. Enrich Input Information

3. Generate New Data

4. Optimise Word Masks

During-training Enhancement Models

1. Incorporate Knowledge Encoders

2. Insert Knowledge Encoding Layers

3. Add Independent Adapters

4. Modify the Pre-training Task

Post-training Enhancement Models

1. Fine-tune PLMs with knowledge

2. Generate Knowledge-based Prompts

Conclusion

References

Read Next

Data and Analytics with AI

What is Agentic AI?

Introduction to Roboflow and Object Detection with AI

How AI Helps Business Owners Understand Their Businesses’ Needs Through RAG and Automation

How to Revolutionise Game Asset Creation Using AI Image Generation Models

How AI Can Help Transform Developer Productivity Through Code Assistants

How AI Transforms Mental Wellness: From Applications to Essential Questions

From Silicon Valley to the Global South: Who’s Being Left Behind in the AI Revolution?

Code Generation: GPT vs Llama 4

From Zero to Classifier: Building a Lightweight LLM with OLMo 2 on Your Laptop

Subscribe to Newsletter