A Deep Dive Into Knowledge Graph Enhanced Pre-trained Language Models
Techniques to integrate Knowledge Graphs into Language Models
Introduction
Both knowledge graphs (KGs) and pre-trained language models (PLMs) have gained popularity due to their ability to comprehend world knowledge and their broad applicability. KGs are instrumental in applications like search engines, evident from Google’s Knowledge Graph. On the other hand, popular PLMs like BERT and GPT excel in a variety of natural language tasks. More recently, large language models (LLMs), a type of PLM, have taken the world by storm. PLMs are models pre-trained on large datasets and LLMs are essentially PLMs which are characterised by their larger size and complexity. For example, OpenAI’s GPT-3 is a LLM boasting 175 billion parameters. Throughout this article, we will interchangeably use LLMs and PLMs to refer to these models.
In our previous article titled “Automated Knowledge Graph Construction with Large Language Models,” we discussed the complementary relationship between KGs and PLMs to harness the strengths of both sides. KGs explicitly capture information about entities and their relationships, while PLMs contain implicit knowledge learnt from their vast and diverse training datasets. Combining them into a single model gives us knowledge graph enhanced pre-trained language models (KGPLMs).
KGPLMs can be categorised into three types: before-training enhancement, during-training enhancement, and post-training enhancement models. These are distinguished by the stage at which KGs are integrated into the pre-training phases of PLMs. We will delve deeper into these three models in the following sections.
Before-training Enhancement Models
These models convert text data and KG triples into the same format for input into PLMs. There are two challenges when attempting to do so: the heterogeneous embedding space and knowledge noise. The former refers to the incompatible formats of text and KG data, while the latter involves inserting unrelated knowledge that changes the intended meaning of the text. For domains with limited training data, these methods are effective in supporting PLMs, at the cost of increased computational resources, pre-training time, and the possible introduction of knowledge noise. Nonetheless, they can improve the reasoning capabilities of PLMs without augmenting PLM characteristics like model size. Below, we discuss four techniques utilised by before-training enhancement models.
1. Expand Input Structures
This technique converts text input into a graph structure which can be merged with an existing KG. The merged KG can subsequently be converted into text for PLM input or encoded in another way. An example is in K-BERT, which converts input sentences into sentence trees by injecting relevant knowledge from a KG through the knowledge layer. The sentence tree is encoded into tokens and a visibility matrix controls knowledge noise by preventing the influence of irrelevant neighbouring tokens. By changing the supporting KG in the knowledge layer, K-BERT performed well in specific domains like medicine.
2. Enrich Input Information
In this method, entity embeddings are combined with text embeddings in the model. The example shown above is of a transformer utilising LUKE (Language Understanding with Knowledge-based Embeddings). It is an extension of BERT, a masked language model that randomly masks tokens by replacing words with the [MASK] token, and encodes the tokens and positions of the given sentence. Since BERT only encodes words, but entities like “”Los Angeles” can span multiple words, LUKE was trained to mask and predict entities as well. In this case, input information was enriched to contain entities in addition to words.
3. Generate New Data
Here, KG knowledge is used to generate synthetic text which is then injected into a PLM. For example, ATOMIC (ATlas Of MachIne Commonsense) is a dataset that focuses on capturing three types of if-then relations: If-Event-ThenMental-State, If-Event-Then-Event, and If-Event-Then-Persona. A PLM was trained to generate a target sequence given an event phrase and an inference dimension based on the KG. For instance, when prompted with “PersonX wins the title” and “As a result, X wants to”, the PLM could generate inferences about actions or consequences such as “celebrate”, “brag” and “congratulate themselves”.
4. Optimise Word Masks
Masked language models like BERT randomly mask words during training. However, there may be correlation between consecutive words that can be lost if one of the words is randomly masked. For example, if “Los Angeles” is masked and becomes “Los [MASK]”, the model might fail to understand semantics and treat “Los” and “Angeles” separately despite their obvious correlation. ERNIE (Enhanced Language RepresentatioN with Informative Entities) avoids this by identifying entities in text and aligning them with corresponding entities in KGs. The model was trained by randomly masking entities instead of single words and asked to choose appropriate entities from KGs to fill in the masked positions.
During-training Enhancement Models
While before-training enhancement models focused on incorporating KG data into text data, during-training enhancement models give PLMs knowledge during training by modifying their encoders or training tasks. This allows PLMs to learn from text and KGs simultaneously. Specialised information or modules can be added in for specific domains or tasks. However, these models usually have larger architectures, hence requiring longer training times. They are limited to the scope of the training data and are prone to overfitting due to larger and more complex architectures. During-training enhancement models are most beneficial when multiple complex tasks are required or for knowledge-grounded tasks. There are four possible approaches to achieving this.
1. Incorporate Knowledge Encoders
Models may incorporate various encoders for text, KGs, or a hybrid of these data formats to obtain embeddings for the model. For example, the previously mentioned ERNIE model uses a textual encoder (T-Encoder) and a knowledgeable encoder (K-Encoder). First, the T-Encoder captures lexical information from input tokens by encoding text similarly to BERT, utilising the token, segment and positional embeddings. The graph structure of KGs is encoded with algorithms like TransE. Both textual and graph encodings are given to the K-Encoder to fuse the information into a unified representation.
2. Insert Knowledge Encoding Layers
This method involves inserting layers into PLMs that inject knowledge from KGs. KnowBERT by Peters et al. (2019) achieves this by inserting a Knowledge Attention and Recontextualisation (KAR) mechanism between two layers in the middle of the BERT architecture. This additional layer takes in the contextual representations from the previous BERT layer, uses an entity linker to obtain relevant entity embeddings from KGs, and outputs knowledge-enhanced embeddings to the next BERT layer.
3. Add Independent Adapters
Whilst other methods require PLM parameter updates, the K-ADAPTER framework seeks to preserve the PLM’s parameters to support the injection of multiple types of knowledge without entangling the resulting representations. This allows the effects of each knowledge type to be studied independently. These adapters, which are compact neural models, accept hidden states from the PLM’s intermediate layers as input, as shown in the diagram above. In K-ADAPTER, the authors aligned Wikipedia text to Wikidata triplets and injected this knowledge using such an adapter.
4. Modify the Pre-training Task
Rather than random masking, masked entity modelling may be adopted instead. The previously discussed ERNIE masks entities rather than individual words to capture entities that span multiple words. Other methods integrate knowledge representation learning to concurrently update knowledge representations and PLM parameters. For instance, the KEPLER (Knowledge Embedding and Pre-trained LanguagE Representation) model jointly optimises the knowledge embedding and masked language modelling objectives, by following existing PLM approaches as well as encoding entity descriptions as entity embeddings to represent relational facts in knowledge graphs.
Post-training Enhancement Models
Post-training enhancement is achieved by fine-tuning PLMs on additional data or tasks, which is a relatively low-cost and straightforward way of tailoring PLM outputs to domain-specific tasks. However, the data needs to be labelled and crafting prompts requires prior knowledge and external resources. Furthermore, prompts are designed to constrain the PLM’s generations, but may limit its flexibility during text generation. There are two methods used for post-training enhancement.
1. Fine-tune PLMs with knowledge
In the example of KagNet shown in the above image, an initial question-answer pair is encoded by a PLM such as BERT. Additional knowledge is included in the form of a schema graph containing relevant knowledge to the given question and answer. This is retrieved from external knowledge graphs like ConceptNet. The graph is encoded by a knowledge-aware graph network module. Finally, KagNet generates a plausibility score, allowing the model to pick the answer with the highest score as the most likely answer.
2. Generate Knowledge-based Prompts
An example of this approach is found in the knowledge-to-text framework proposed by Bian et al. (2021). Given a question and a series of possible answers, the first knowledge retrieval step obtains relevant facts from a KG. These facts are transformed into text in the second knowledge-to-text transformation step via template-based, paraphrasing-based, and retrieval-based algorithms. Finally, the obtained text is fed into a PLM for it to predict the answer.
Conclusion
In summary, this article explored three types of knowledge graph enhanced pre-trained language models (KGPLMs): before-training enhancement, during-training enhancement, and post-training enhancement models. There were several methods described in each model type, as listed in the table above. We discussed the advantages and limitations of each type of KGPLM which determines the situations they are suitable for. However, these methods are definitely not mutually exclusive. For instance, the ERNIE model showcased the optimisation of word masks, incorporated knowledge encoders, and modified pre-training tasks. Although we only focused on a few models to illustrate these methods, numerous models utilise these techniques in diverse ways, as seen in the table above. There are also newer models that have not been included in the article that continue to expand the capabilities of KGPLMs and will continue to do so.
References
- Yang, L., Chen, H., Li, Z., Ding, X., & Wu, X. (2023). Give Us the Facts: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2306.11489
- Liu, W., Zhou, P., Zhao, Z., Wang, Z., Ju, Q., Deng, H., & Wang, P. (2020). K-BERT: Enabling Language Representation with Knowledge Graph. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, Issue 03, pp. 2901–2908). Association for the Advancement of Artificial Intelligence (AAAI). https://doi.org/10.1609/aaai.v34i03.5681
- Yamada, I., Asai, A., Shindo, H., Takeda, H., & Matsumoto, Y. (2020). LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2010.01057
- Sap, M., Le Bras, R., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof, B., Smith, N. A., & Choi, Y. (2019). ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, Issue 01, pp. 3027–3035). Association for the Advancement of Artificial Intelligence (AAAI). https://doi.org/10.1609/aaai.v33i01.33013027
- Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced Language Representation with Informative Entities (Version 3). arXiv. https://doi.org/10.48550/ARXIV.1905.07129
- Peters, M. E., Neumann, M., Logan, R. L., Schwartz, R., Joshi, V., Singh, S., & Smith, N. A. (2019). Knowledge Enhanced Contextual Word Representations (Version 2). arXiv. https://doi.org/10.48550/ARXIV.1909.04164
- Wang, R., Tang, D., Duan, N., Wei, Z., Huang, X., ji, J., Cao, G., Jiang, D., & Zhou, M. (2020). K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2002.01808
- Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J., & Tang, J. (2019). KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation (Version 3). arXiv. https://doi.org/10.48550/ARXIV.1911.06136
- Lin, B. Y., Chen, X., Chen, J., & Ren, X. (2019). KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning (Version 1). arXiv. https://doi.org/10.48550/ARXIV.1909.02151
- Bian, N., Han, X., Chen, B., & Sun, L. (2021). Benchmarking Knowledge-Enhanced Commonsense Question Answering via Knowledge-to-Text Transformation (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2101.00760
Catch the latest version of this article over on Medium.com. Hit the button below to join our readers there.