Introduction

Misinformation refers to false or inaccurate information that is often given to someone in a deliberate attempt to make them believe something that is not true. This has a significantly negative impact on public health, political stability and social trust and harmony. To address these challenges, large language models (LLMs) are employed to identify inconsistency and verify the veracity of statements. While traditional LLMs excel at text-based tasks, they often have trouble understanding and processing other types of data. To bridge this gap, multimodal LLMs capable of processing and generating multiple mode data have been developed. By integrating text, images, audio and video, these models are able to understand and analyse different forms of data comprehensively. Multimodal LLMs have a range of applications including enhancing digitising note taking by extracting text from digital images, understanding complex things like parking signs, deciphering ancient handwriting, analysing speech files for summarisation, transcription etc. Researchers are actively exploring methods to verify multimodal content and interpret images more effectively, aiming to use multimodal LLMs to identify and interpret multimodal misinformation in a cost-effective and accessible manner.

This article introduces two innovative methods utilising multimodal LLMs and it explains them when addressing an important use case: analysis of misinformation.

Retrieval-augmented Generation (RAG)- based advanced reasoning techniques (RAGAR)

Retrieval-Augmented Generation (RAG) is an advanced method that optimises the output of LLMs by incorporating information retrieval into the generation process. For a more detailed introduction of RAG, please find the article here RAG: The next big thing after LLMs?

Retrieval-augmented Generation (RAG)- based advanced reasoning techniques (RAGAR) employ multimodal LLMs that process both textual and visual information. The two specific RAGAR approaches are (1) Chain of RAG (CoRAG) and (2) Tree of RAG (ToRAG). Both methods incorporate multimodal LLMs and utilise multimodal RAG coupled with reasoning in the fact-checking domain.

1. Chain of RAG (CoRAG)

CoRAG enhances the general RAG by employing a sequential strategy to pose questions derived from a claim and its associated image context. An LLM formulates an initial question to address a particular aspect of the claim, which is then answered using a multimodal RAG system. After each response, a follow-up check is used to assess the sufficiency of information obtained. If further clarification is needed, the process iteratively generates informed follow-up questions, leveraging the accumulated answers. This iterative cycle is limited to a maximum of six questions or terminates early when sufficient evidence has been gathered to decisively evaluate the claim’s veracity.

Chain of RAG (CoRAG) pipeline. Source from https://doi.org/10.48550/ARXIV.2404.12065

2. Tree of RAG (ToRAG)

ToRAG expands on CoRAG by branching out the questioning process at each reasoning step. When fed a claim and image context input from multimodal claim generation, ToRAG generates three unique questions to fact-check the claim and employs an evidence retriever to fetch and generate answers. These answers pass into an elimination prompt from which a selection process is conducted to choose the best question-answer pair based on criteria like relevance, detail, answer confidence and is retained as candidate evidence. This candidate evidence then informs the next set of three follow-up questions, continuing the cycle. With each iteration, only the most informative question-answer pair is kept, creating a cumulative list of candidate evidence. The ToRAG cycle terminates when it has either accumulated a sufficient evidence base to assess the claim’s veracity or when it reaches the limit of six question-answer pairs. The final selected evidence informs the veracity prediction module, which then produces a final verdict and explanation for the claim.

Tree of RAG (ToRAG) pipeline. Source from https://doi.org/10.48550/ARXIV.2404.12065

Example of RAGAR in action

The image below shows a comparison between three different methods of fact-checking claims to validate a political claim. The claim is about “A politician claimed that when he was governor, they banned sanctuary cities and implemented electronic verification”.

  1. Baseline: SubQuestion Generation (no RAGAR)

This approach generates subsidiary questions to validate the claim and concludes with a ‘failed’ verdict prediction, indicating that the evidence does not support the claim.

2. RAGAR CoRAG

CoRAG uses a sequence of questions, each building on the last, to explore the claim in depth. It concludes that the actions mentioned in the claim are ‘supported’ by the evidence.

3. RAGAR ToRAG

ToRAG employs a branching mechanism to generate multiple lines of inquiry from the initial claim. It filters through the evidence to select the most relevant question-answer pairs, concluding the claim as ‘supported’.

An overview of the fact-checking pipeline contrasting the baseline Sub-Question Generation approach from the RAGAR: Chain of RAG and RAGAR: Tree of RAG approach. Source from https://doi.org/10.48550/ARXIV.2404.12065

Multimodal Misinformation Interpretation and Distillation Reasoning (MMIDR)

The Multimodal Misinformation Interpretation and Distillation Reasoning (MMIDR) framework is designed for detecting multimodal misinformation and enhancing the ability of large language models (LLMs) to interpret it. The framework aims to provide fluent and high-quality textual explanations for the decision-making process of these models. The process of the framework is described in subsequent images, firstly enhancing the multimodal misinformation data through data augmentation to accommodate the instruction-based format. Following this, a labelling template is employed to prompt the teacher LLM for the extraction of rationales. The final steps involve performing knowledge distillation to transfer the capability of teacher LLMs in explaining multimodal misinformation into student LLMs and then train the student LLM on merging the original multimodal information with these distilled rationales.

Model Architecture Overview of MMIDR. Source from https://doi.org/10.48550/arXiv.2403.14171

Data augmentation

This process involves converting image-text pairs into an appropriate instruction following format. It contains two steps:

  1. Visual Information Processing

Given visual content, utilise optical character recognition (OCR) technology to discern and transmute characters in the image into an editable text format. Simultaneously, employ image captioning technology to generate descriptions for the image content.

2. Evidence Retrieval

This involves retrieval evidence from the internet, therefore the model is powered with global knowledge. Through employing the google reverse image which can produce a selection of images similar to the query and we can retrieve textual evidence from these images including titles and descriptions. Similarly, also retrieve the visual evidence from the images using aforementioned visual information processing techniques.

Through data augmentation, we have a list of instances comprising the textual content of the post, OCR-recognised text, text generated from image caption with visual content, textual evidence and visual evidence.

Rationales Elicitation

Rationales elicitation refers to generating explanations for the given multimodal information using a teacher LLM. A simple labelling prompt template is used to prompt the teacher LLM to generate rationales. Therefore the output encompasses both the rational and the ground-truth label could be used to train the student LLM subsequently.

Labelling Prompt Template. Source from https://doi.org/10.48550/arXiv.2403.14171

Knowledge Distillation

Knowledge distillation involves fine-tuning the student LLM using the teacher LLM to maximise the alignment of their output sequences, therefore aligning the predictions of the student and the teacher. It is similar to the process of teachers teaching their knowledge to their students. LowRank Adaptation (LoRA) optimisation is used to facilitate this tuning process, wherein the optimisation is solely applied to the rank decomposition components of the Transformer layers. This process uses an instruction template which is derived from the labelling prompt template by removing label-related contents.

Conclusion

In conclusion, This article presents two innovative methodologies for applying multimodal large language models (LLMs) in misinformation detection and interpretation, demonstrating their potential of combating misinformation in a multimodal digital world. However, these methods still have limitations. For the RAGAR approach (CoRAG and ToRAG), notice from the experimental result that when feeding different promptings in the same query multiple times, the search results may vary. This is problematic since it affects the interpretation result and it is hard to choose from these different results as the final results. While the ToRAG approach addresses this by analysing answers to slightly varied questions and selecting the best question answer pair that provides the most information. For the MMIDR method, it has a performance gap between the distilled student model and the teacher model. Future challenges include improving the stability of RAGAR results and enhancing the knowledge distillation process in MMIDR. Solving these issues will be crucial for maximising the potential of multimodal LLMs in solving misinformation problems effectively.

References

  • Khaliq, M. A., Chang, P., Ma, M., Pflugfelder, B., & Miletić, F. (2024). RAGAR, Your Falsehood RADAR: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2404.12065
  • Wang, L., Xu, X., Zhang, L., Lu, J., Xu, Y., Xu, H., Tang, M., & Zhang, C. (2024). MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2403.14171

Catch the latest version of this article over on Medium.com. Hit the button below to join our readers there.

Learn more on Medium