Tuning Vision-Language Models and Generative Models with Knowledge Graph
Bridging Human Perception and AI’s Future: The Convergence of Visual Understanding and Semantic Networks
Introduction
The fusion of Vision-Language Models (VLMs), Generative Models, and Knowledge Graphs (KGs) is reshaping how artificial intelligence (AI) understands and interacts with the world. For example, Automatic Image Description for Visually Impaired Users exists in which AI can generate accurate and detailed descriptions of images, aiding visually impaired users in understanding their content.
VLMs integrate visual and textual information, enabling tasks like image captioning, while Generative Models create new, diverse content such as generating text, video, and images. KGs, with their structured representation of real-world entities and relationships, enhance these models by providing deep, contextual insights. This combination unlocks more accurate, relevant, and contextually rich AI capabilities, from improved search engines to creative content generation, making technology more intuitive and closer to human-like understanding and creativity.
Knowledge Graphs
A knowledge graph (KG) is a structured representation of real-world entities and their interrelationships, encapsulating complex information in a graph-structured form where nodes represent entities, and edges denote relationships. For instance, ‘Leonardo DiCaprio’ and ‘Inception’ are nodes linked by an edge representing his role in the film. For more details, refer to this.
Vision-language Models
Vision-language models (VLMs) are tools that combine what they see in pictures with written words to both understand and create new content. They are good at making captions for pictures and answering questions about what’s in an image. DALL-E is a prime example of VLMs. It showcases the capabilities of VLMs by generating images from textual descriptions, combining the fields of computer vision and natural language processing. For instance, if given the prompt “A futuristic city at sunset,”. It generates the following image
Generative Models
Generative models, often part of the broader VLM framework, are capable of creating new content, such as images or text, by learning the underlying distribution of training data, enabling applications like text-to-image synthesis (as in the above instance) and style transfer.
How knowledge graphs work with LLMs
Tuning Vision-Language Models With Dual Knowledge Graph
Previous techniques of tuning Vision-Language models like CLIP-Adapter, TaskRes, and Tip-Adapter have two main issues associated with them. First, they focus on adapting models using knowledge from a single modality (either text or images) which means that they don’t fully utilise the relationship between the two modalities.
For example, consider an image of a cat and a dog playing together. Focusing solely on the image might lead the model to identify individual objects (cat, dog) but miss the interaction between them. Likewise, focusing solely on the text description might not capture the visual details of the scene.
Second, they do not fully leverage the structured knowledge of relationships between different concepts, particularly in scenarios with limited data that can lead to several issues such as suboptimal solutions, bias towards partial attributes, and inefficient transfer and generalisation.
For example, if the VLM encounters an image of a chef holding a knife, it might struggle to understand the specific action (chopping) without additional knowledge about the relationship between chefs, knives, and food preparation.
The new method of using Dual Knowledge Graphs addresses the limitations of current VLM tuning methods, particularly under scenarios with limited data. The core innovation is the GraphAdapter, a strategy that leverages dual KGs — separate but interconnected graphs for textual and visual knowledge — to enrich the model’s understanding and generation capabilities.
By creating two interconnected KGs — one for text and another for visual information — the GraphAdapter enables VLMs to draw on a richer set of relationships and semantic understandings. This dual-graph approach allows the model to better capture the nuances of how objects and concepts are related across visual and textual data, leading to more accurate and context-aware outputs.
In the above image, while the output classifications might look similar at a glance, the essence of GraphAdapter’s innovation is not just in the accuracy of classification but in how it achieves this result. By leveraging dual Knowledge Graphs, GraphAdapter can potentially offer richer contextual understanding and generalisation capabilities, especially in “low-data regimes” or when faced with nuanced, fine-grained distinctions between classes. This approach marks a significant shift from previous methods, aiming to deeply integrate cross-modal knowledge and structured relationships into the adaptation process for Vision-Language Models.
A direct comparison of the results across different methods, including Zero-shot CLIP, CLIP-Adapter, TaskRes, and their proposed GraphAdapter, has been shown in the image below. It shows how GraphAdapter consistently outperforms the baseline methods across different numbers of shots, underscoring the effectiveness of integrating dual Knowledge Graphs for structured knowledge exploitation.
Working
- Starting point (Input Images and Texts): The process starts with images and their corresponding texts. These could be things like photos of animals along with descriptions or labels.
- Transformation (Text and Visual Encoders): The text descriptions are processed by a Text Encoder, which transforms the text into a vector by word embedding like word2vec with which the computer can work more efficiently. Similarly, the Visual Encoder processes the images, turning them into matrices.
- Mapping relationships (Dual Knowledge Graphs Creation): For text, a Textual-Subgraph is created. It’s like making a map that shows how different words or phrases are related based on their meanings. For visuals, a visual sub-graph aims to capture and model the relationships between different visual elements and concepts within the images. This involves understanding what the image contains (e.g., objects, scenes, actions) and how these elements are related to each other in terms of their visual and contextual relationships.
- Refining Connections (Convolutional Networks (GCNs)): These are special tools that help to blend and refine the information in our text and visual maps (the subgraphs), making sure the connections and relationships are as accurate and helpful as possible. K and d are the number of classes and the dimension of textual/visual features.
- Fusion and Adjustment (GraphAdapter): This is the heart of the process. It takes the refined maps of texts and visuals and combines them, ensuring that the final output makes sense both visually and textually. It’s like making sure the description “a big red apple” matches with pictures of apples, not bananas.
- Final Output: The final step produces adjusted or enhanced text and image features, ensuring that images match their descriptions accurately and vice versa. This can be used to make better image recognitions, more accurate image descriptions, and so on.
Performance Comparison
The model evaluates its approach, GraphAdapter, on few-shot learning tasks using 11 datasets and observes significant improvements over existing methods, particularly for tasks with limited examples (1- to 16-shot settings). It also examines how well GraphAdapter generalises unseen data by testing it across four diverse datasets. The findings reveal that GraphAdapter not only excels in adapting to new tasks with few examples but also demonstrates strong generalisation capabilities, outperforming other state-of-the-art methods in most scenarios.
Knowledge Graph Embeddings (KGEs) can enhance generative models
Instead of using complex formulas, researchers propose a new way to make KGEs like COMPLEX, CP, RESCAL, and TUCKER generate new relationships between concepts. They achieve this by:
- Transforming existing KGE models into circuits: These circuits involve calculations that consider multiple possibilities to arrive at different probabilities.
- Fine-tuning the calculations within the circuits: This ensures the final output is always a valid probability (between 0 and 1).
Working
Knowledge Graph Embedding (KGE) converts models such as COMPLEX, CP, RESCAL, and TUCKER into generative models by reinterpreting them as circuits, or structured computational graphs. This reinterpretation allows for efficient processes like marginalisation, which is crucial for understanding the distribution of certain variables within a larger set. Marginalisation is a fundamental concept in probability theory and statistics. When you have a probability distribution over multiple variables, marginalisation involves summing (or integrating, in the case of continuous variables) the probabilities of all possible values of the variables you are not interested in, to obtain the probability distribution of the variables you are interested in.
To make these models generative, their outputs are modified through non-negative restriction or squaring, ensuring the outputs can represent probabilities. These adapted circuits, named Generative KGE Circuits (GeKCs), can then generate new triples for knowledge graphs by efficiently sampling from their modeled probability distributions. Moreover, GeKCs are designed to integrate logical constraints directly, ensuring that all generated or predicted triples are logically consistent, such as adhering to rules that specify how entities can or cannot relate to each other. This approach not only retains the models’ link prediction capabilities but also enhances their applicability by enabling them to handle large graphs efficiently and generate new, plausible triples that respect predetermined logical constraints.
Consider the following scenario as an example:
Scenario: Imagine a knowledge graph containing information about people, movies, and their genres. We want to use a GeKC Model to predict missing information.
Logical Constraint: One logical constraint could be: “A person cannot act in a movie that is released before their date of birth.”
Existing Triples:
- (Tom Hanks, acted_in, Forrest Gump)
- (Tom Hanks, date_of_birth, 1956)
Prediction Task: Predict the release date of “Forrest Gump”.
Without Logical Constraints: The model might simply predict any release date based on statistical patterns in the data. This could lead to illogical predictions like “Forrest Gump” being released in 1954, which would contradict Tom Hanks’ date of birth.
With Logical Constraints: The model with logical constraints would consider the “date_of_birth” information and the constraint mentioned earlier. This would eliminate the possible release date predictions of a date before 1956 (Tom Hanks’ date of birth).
Evaluation of the model
The empirical evaluation demonstrates that GeKCs are competitive with traditional Knowledge Graph Embeddings (KGEs) for link prediction tasks, showing that the generative approach does not compromise on accuracy. Furthermore, incorporating domain constraints directly into the model significantly enhances predictions by ensuring they adhere to logical rules, improving reliability and relevance. Lastly, the quality of triples generated by GeKCs is evaluated through a novel metric, revealing that GeKCs can efficiently produce high-quality, new triples that are consistent with the knowledge graph’s existing information, thereby enriching the graph with plausible connections.
Potential Applications
Improved Question Answering Systems: This new method could allow the virtual assistants to not only find existing answers within the knowledge graph but also generate new, logical relationships between concepts. This could lead to more comprehensive and informative answers, even for complex or open-ended questions.
Drug Discovery and Material Science: Researchers could use these enhanced KGEs to explore potential relationships between different chemicals or biological entities. By generating new, plausible connections based on existing knowledge, the system could help identify promising candidates for new drugs or materials with desired properties.
Recommendation Systems and Market Analysis: KGEs are already used by some recommendation systems to understand user preferences and suggest relevant products or services. This new approach could allow the system to go beyond simply recommending existing items. It could potentially generate new product ideas or identify previously unexplored market connections based on the knowledge graph.
Conclusion
The integration of Knowledge Graph Embeddings (KGEs) with Vision-Language Models (VLMs) and Generative Models, particularly through the innovative GraphAdapter and generative KGE circuits (GeKCs), represents a significant leap forward in AI’s ability to understand and generate complex content. This approach not only enhances the models’ predictive accuracy and contextual understanding but also ensures logical consistency in generated content. Moreover, it promises scalability and efficiency, making it a robust solution for enriching and expanding knowledge graphs in various applications.
References
- Niepert, M., Garcia-Duran, A., & Onoro-Rubio, D. (2023). How to turn your knowledge graph embeddings into generative models via probabilistic circuits. arXiv. https://arxiv.org/abs/2305.15944
- Li, X., Lian, D., Lu, Z., Bai, J., Chen, Z., & Wang, X. (2023). GraphAdapter: Tuning vision-language models with dual knowledge graph. arXiv. https://arxiv.org/abs/2309.13625
Catch the latest version of this article over on Medium.com. Hit the button below to join our readers there.