Introduction

Generative Artificial Intelligence (GAI) has demonstrated impressive performances in tasks such as text generation and text-to-image generation. The recent advancements in Multimodal Large Language Models (MLLMs), models that strive to mimic human-like perception by integrating multiple senses (multimodal) such as visual and auditory, have further improved the models’ capabilities to handle multi-format information (e.g. images, videos), opening up possibilities for developing general-purpose learners.

However, these models have their limitations, including the tendency to generate hallucinations (perceive patterns or objects that are nonexistent or imperceptible), struggling with arithmetic tasks, and lacking interpretability.

In this case, a research direction that focuses on retrieving multimodal knowledge to augment generative models is needed, which is called augmented generation. By enabling models to interact with the external world and acquire knowledge in diverse formats and modalities, it can improve the factuality and rationality of the generated content and thereby be a promising solution to the above problem.

In this post, we introduce recent advances in using models based on Retrieving Multimodal Information for Augmented Generation by giving a short introduction to 3 different models and their use cases.

RA-CM3

Retrieval-Augmented CM3 (RA-CM3) is the first retrieval-augmented multimodal model that can retrieve and generate both text and images. The model mainly consists of two parts: the retriever and the generator. The retriever is responsible for asking questions and getting relevant text and images from the web, while the generator uses the retrieved information to generate text or images (e.g. captions for images or images generated based on text description)

For RA-CM3, its multimodal retriever is designed upon a pretrained model named CLIP having a mixed-modal encoder that can encode combinations of text and images, while the retrieval-augmented generator is built using a recent Transformer architecture which is the foundation architecture for many state-of-the-art models in the field of Natural Language Processing, called CM3. It’s the first time a method to unify existing techniques (e.g. CLIP and CM3) into a performant retrieval-augmented model through extensive analyses of design choices is established.

Illustration of RA-CM3 workflow (M Yasunaga et al., 2022) Link: https://arxiv.org/abs/2211.12561

RA-CM3 is proven to significantly improve both image and text generation performance and requires less computational resources when compared with some existing models like DALL-E and CM3. It can also create more realistic images and learn from demonstrations, showing exciting potential for future applications.

MuRAG

Multimodal information can also be used to solve some QA problems. For example, when a user asks “What is the name of the landmark in the image?”, the model can’t generate the correct answer to this question if no image is provided by the user. However, a multimodal QA system that takes both text and image as the input, in this case, can use the combination of text and visual information to accurately answer the user’s question.

Multimodal Retrieval-Augmented Transformer (MuRAG) is proposed as the first model in this category that can be applied to Open Question Answering over Images and Text. It is built on top of a simpler model called “backbone” model, which is pre-trained to encode image-text pairs such that they are suitable for both answer generation and retrieval.

Illustration of MuRAG workflow (W Chen et al., 2022) Link: https://arxiv.org/abs/2210.02928

Experiments show that when compared with other models, MuRAG performs better on datasets with questions about pictures and text (WebQA and MultimodalQA datasets). MuRAG’s answers also matched better with expected answers from human analysis than other models’ answers.

RACE

Apart from using Multimodal information for image-text generation and solving QA problems, it can also be used for Retrieval-based code summarisation. For example, a model can automatically summarise the code content being updated for developers by using multimodal information such as previous summary-code pairs.

A Retrieval-augmented Commit Message Generation model called RACE that uses the exact same idea to perform code summarisation is proposed these days. It leverages relevant code differences and their associated commit messages to enhance commit message generation.

Illustration of RACE workflow (E Shi et al., 2022) Link: https://arxiv.org/abs/2203.02700

Results show that RACE outperforms all the baseline models when evaluated on a large-scale dataset called MCMD in terms of four different metrics. Additionally, it can also boost the performance of existing Seq2Seq models in commit message generation.

Conclusion

The field of Generative AI has witnessed significant advancements, particularly with the introduction of Multimodal Large Language Models (MLLMs) capable of integrating diverse types of information. However, limitations like factual inaccuracies and lack of interpretability persist. Augmented generation, a promising research direction, addresses these issues by allowing models to interact with the real world and acquire knowledge in various formats. This approach opens exciting possibilities for enhancing the factuality, rationality, and overall capabilities of generative models. In this post, 3 different models were covered reflecting the recent advances in using Retrieving Multimodal Information for Augmented Generation in the field of text/image generation, solving QA problems and code summarisation. Future scope for some of these models are discussed as well, such as reducing the training time for RACE, which remains an area of research.

References

Yasunaga, M., Aghajanyan, A., Shi, W., James, R., Leskovec, J., Liang, P., Lewis, M., Zettlemoyer, L., and Yih, W. (2022). Retrieval-augmented multimodal language modeling. CoRR, abs/2211.12561. https://export.arxiv.org/abs/2211.12561v2
Chen, W., Hu, H., Chen, X., Verga, P., & Cohen, W. W. (2022). MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2210.02928
Shi, E., Wang, Y., Tao, W., Du, L., Zhang, H., Han, S., Zhang, D., & Sun, H. (2022). RACE: Retrieval-Augmented Commit Message Generation (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2203.02700

Catch the latest version of this article over on Medium.com. Hit the button below to join our readers there.

Learn more on Medium