Introduction

The human brain is more extraordinary than any machine we could build. From an early age, many of us gain the ability to comprehend what our eyes tell us and articulate it. Furthermore, we combine evidence from all our senses to reason. Machines, on the other hand, have a long way to go. Much research has gone into enabling machines to eventually emulate this human ability through multimodal models, which are capable of taking in various information formats such as images, text, or audio.

One specific research focus is the creation of vision language models (VLMs), which possess the abilities to ‘see’ and understand language. A recent model demonstrating this remarkable ability is GPT-4 with Vision (GPT-4V), which allows users to provide an image and receive responses to questions about the image.

Despite significant developments in VLMs in recent years, they still possess several shortcomings that will be reviewed in this article. The structure of VLMs will be briefly introduced with some VLM variants, followed by example applications in healthcare and helping the visually impaired. To conclude, this article will review failure cases for VLMs and future directions.

Vision Language Models

VLMs are a fusion of computer vision models and large language models (LLMs), which respectively capture the intricacies of images and language. While computer vision models excel at processing image data, they often struggle to understand the meaning behind objects depicted in images. Conversely, language models exclusively operate on textual data, so they are adept at navigating the semantics and ambiguities of language but lack the capability to interpret visual cues. Researchers aim to combine the advantages of both model types, such that models can interpret images as well as the textual instructions provided to them and return well-crafted responses.

Generally, VLMs consist of three components:

an image encoder,
a textual encoder, and
a module to connect image and textual encodings.

An example is BLIP-2, which stands for Bootstrapping Language-Image Pre-training. BLIP-2 contains all three components in VLMs: an image encoder, an LLM serving as the textual encoder, and the Q-Former (Querying Transformer) module to bridge the two aforementioned components. Here, the image and textual encoders are off-the-shelf, meaning they were designed and trained separately by others and made available for use. By bootstrapping or randomly sampling from these encoders, the Q-Former module was trained to extract visual information that was most informative of the text, and then give the most useful information to the LLM for the LLM to generate the answer. Therefore, the Q-Former is placed between the image encoder and LLM in the image, like a connecting bridge. Comprehensive overviews about VLMs in general can be found on HuggingFace and in this article.

An example (BLIP-2) which contains the three components of VLMs. Source: Li et al., 2023 Article link: https://doi.org/10.48550/ARXIV.2301.12597

Modifications to Vision Language Models

The modular format of VLMs allows researchers to interchange individual components to enhance their models. For instance, among the various modifications proposed previously are:

Using a different image encoder: In KRISP by Marino et al. (2020), information from the given image is encoded into a knowledge graph. This data structure connects objects presented in the image with relationships to aid the model in understanding semantics. Image data is combined with language information from the BERT LLM.

Example of KRISP model answering a question given an image. Source: Marino et al., 2020 Article link: https://doi.org/10.48550/arXiv.2012.11014

Employing different connecting modules: As illustrated in the figure above of BLIP-2, the Q-Former module was introduced.
Training the components in a novel manner: Liu et al. (2024) proposed LLaVA which employed the CLIP (Contrastive Language-Image Pre-training) model and Vicuna to encode text and language respectively. To enable their model to perform a variety of tasks better, the authors used GPT-4 to generate instruction data to train the model to follow instructions more effectively.
Adopting a retrieval-augmented approach instead: In the REVEAL (Retrieval-Augmented Visual Language) model from Hu et al. (2023), the authors opted to encode knowledge from multimodal sources into their model, including text and image data. When queried, the model consults its knowledge base to enrich its generated response with factual information to enhance accuracy.

REVEAL retrieves knowledge to craft responses to questions. Source: Hu et al., 2023 Article link: https://doi.org/10.48550/arXiv.2212.05221

What are Vision Language Models Good For?

VLMs serve as valuable tools to assist humans in various applications including image captioning, visual summarisation and visual question answering, among others. These can be extended to applications with broader impacts, such as:

Visual Question Answering in Medical Imagery

Example of VLM being used in medical imagery. Source: Bazi et al., 2023 Article link: https://doi.org/10.3390/bioengineering10030380

Although VLMs are far from being utilised in the healthcare industry, they display the potential to support medical practitioners in improving diagnosis. When presented with an image, VLMs can answer questions about it, such as in the image above. As machine learning models are trained on vast amounts of past data, medical professionals could access this wealth of data that would otherwise be too much for any one person to handle.

2. Be My Eyes

In March 2023, OpenAI collaborated with Be My Eyes to develop Be My AI, which incorporates GPT-4V. This technology allows people who are blind or have low vision to capture pictures using their smartphones and receive descriptions of the images from Be My AI, thereby allowing them greater ease in navigating the visual world. Users are currently cautioned against relying solely on the application, as GPT-4V could potentially hallucinate when unsure about what it perceives. With further development, users may eventually be able to use applications such as this to obtain additional assistance in tasks such as reading prescriptions or navigating environments.

Where Vision Language Models Fail

Even the state-of-the-art models like GPT-4V can fail in simple scenarios for humans. Some examples are shown in the image below.

Cases where GPT-4V fails to answer correctly. Source: Tong et al., 2024 Article link: https://doi.org/10.48550/ARXIV.2401.06209

Despite how remarkable VLMs appear to be, their performances often fall short compared to what humans can achieve. Most models were found by Tong et al. (2024) to even perform worse than a random guessing strategy. The following results were obtained through testing various models using dual-choice questions termed CLIP-blind image pairs. For example, an image of a dog lying down would be given with the question, “Where is the yellow animal’s head lying in this image?” The model would then choose between two options: “(a) Floor” or “(b) Carpet”.

Results of various models on CLIP-blind pairs. Source: Tong et al., 2024 https://doi.org/10.48550/ARXIV.2401.06209

Moreover, Tong et al. (2024) discovered that the problems were even greater for models utilising CLIP as the image encoder. This presents a significant issue given the popularity of CLIP, particularly in cases like medical imagery mentioned previously. CLIP’s popularity stems from its incorporation of both visual and textual encodings in its architecture, providing a convenient means of linking these two modalities.

Another significant finding was that merely increasing the size of the model did not mean it got better at interpreting visual cues. Rather, it was more effective to use a purely visual encoder like DINOv2 instead of a vision-language one like CLIP, but at the expense of the model losing some ability to follow instructions.

Conclusion

Vision language models have undergone rapid improvement and will continue to do so in the coming years. Their modular architecture permits various modifications discussed in this article, and they display significant potential for practical applications like in healthcare or helping the visually impaired. However, they still cannot perform at the level that humans do. Further research is necessary to address their shortcomings before these models can be utilised reliably without human oversight.

References

Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2301.12597
Marino, K., Chen, X., Parikh, D., Gupta, A., & Rohrbach, M. (2020). KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2012.11014
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2304.08485
Hu, Z., Iscen, A., Sun, C., Wang, Z., Chang, K.-W., Sun, Y., Schmid, C., Ross, D. A., & Fathi, A. (2022). REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2212.05221
Bazi, Y., Rahhal, M. M. A., Bashmal, L., & Zuair, M. (2023). Vision–Language Model for Visual Question Answering in Medical Imagery. In Bioengineering (Vol. 10, Issue 3, p. 380). MDPI AG. https://doi.org/10.3390/bioengineering10030380
OpenAI (2023). GPT-4V(ision) System Card. https://openai.com/research/gpt-4v-system-card
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., & Xie, S. (2024). Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2401.06209