Exploring Aya Vision: Cohere's Multimodal Model and Its Comparison with GPT

Introduction

Vision-enabled AI models have rapidly evolved to become essential tools across numerous applications, from content moderation to image analysis and multimodal reasoning. Cohere's recent entry into this space with their Aya Vision model promises to deliver competitive capabilities in the increasingly crowded market of multimodal AI systems.

In this blog post, I'll share my hands-on experience testing Aya Vision (32B model) against GPT-4o, focusing on several key areas critical for real-world applications. Rather than relying on marketing claims or theoretical specifications, this analysis is based on direct testing with identical prompts and images across both models.

What is Aya Vision?

Aya Vision, part of Cohere’s Aya family, aims to make generative AI accessible across languages and modalities. It’s available in two sizes—8 billion parameters (Aya Vision 8B) and 32 billion parameters (Aya Vision 32B)—and is optimised for vision-language tasks. As an open-weight model, it’s freely accessible for non-commercial research via platforms like Hugging Face and Kaggle. You can also access it using their Playground platform. Supporting 23 languages and covering half the world’s population, Aya Vision is designed for tasks like image captioning, visual question answering, text generation, and translation, making it a versatile tool for global applications.

Testing Methodology

My testing approach involved challenging both models with identical images and prompts across several categories:

Technical Code Analysis: Evaluating how well each model can interpret and explain programming code in images
Basic Image Recognition: Testing fundamental object identification capabilities
Visual Reasoning: Assessing the ability to make accurate inferences about visual information
OCR and Text Interpretation: Examining how effectively the models can read and understand text in images
Object Counting: Testing precision in counting and identifying multiple objects
Multilingual Capabilities: Assessing ability to recognise and translate non-English text

Test Results

1. Technical Code Analysis: Flutter Code Snippet

The Challenge: Both models were presented with a Flutter (Dart) code snippet and asked to analyse it.

Prompt: Analyse the code snippet shown in this image.

Image used:

GPT-4V's Response:

✅ Correctly identified the language as Dart/Flutter
✅ Accurately recognised key components (FloatyHead, MaterialApp, StatefulWidget)
✅ Correctly inferred the likely purpose (floating window/outgoing call notification)
✅ Noted the red squiggly lines indicating possible errors
✅ Provided practical suggestions for fixing the code
✅ Used a well-structured, developer-friendly format

Aya Vision's Response:

❌ Incorrectly identified the language as TypeScript
❌ Misinterpreted Flutter-specific components as React/Angular
❌ Created fictitious explanations about non-existent components (e.g., "sync" function)
❌ Mentioned JSX-like syntax which isn't present in the actual code
❌ Made up components that don't exist in the code ("PaddedText", "View")
❌ Exhibited significant hallucination, inventing content not present in the image

Analysis: In this test, GPT-4V demonstrated significantly superior technical understanding and accuracy. Aya Vision's response included substantial hallucinations, raising serious concerns about its reliability for technical use cases.

2. Basic Image Recognition: Eyeglasses

The Challenge: Both models were shown an image of transparent eyeglasses on a white surface and asked to describe what they saw.

Prompt: Describe what you see in this image

Image used:

GPT-4V's Response:

✅ Correctly identified eyeglasses
✅ Noted the translucent/light-colored frame
✅ Described the retro, oversized design
✅ Mentioned metallic elements on the temples
✅ Kept the description accurate and concise

Aya Vision's Response:

✅ Correctly identified clear plastic glasses
✅ Described the rectangular frame with rounded corners
✅ Noted the minimalist design
✅ Mentioned the white, textured surface underneath
✅ Provided accurate details about the glasses' condition

Analysis: Both models performed well on this basic image recognition task, with similarly accurate descriptions. This suggests that for straightforward object identification, both models have comparable capabilities.

3. Visual Reasoning: Food Identification

The Challenge: Both models were shown an image of "mushroom buns" in a bamboo steamer basket and asked to identify the dish.

Prompt: Can you guess what dish is in the image?

Image used:

GPT-4V's Response:

✅ Correctly identified the dish as "mushroom buns" served at dim sum restaurants
✅ Explained that they're steamed buns designed to look like mushrooms
✅ Noted that they typically contain sweet fillings like custard or red bean paste
✅ Recognised the intentional design mimicking mushroom caps
✅ Showed cultural knowledge about the dish's popularity

Aya Vision's Response:

❌ Incorrectly identified the dish as "char siu bao" (barbecued pork buns)
❌ Made incorrect assumptions about the filling being barbecued pork
❌ Provided irrelevant information about caramelisation
✅ Correctly identified the bamboo steamer and dim sum connection

Analysis: This test revealed significant differences in visual reasoning capabilities. GPT-4V correctly identified the uniquely designed mushroom buns, while Aya Vision misidentified them as a different dim sum dish entirely, demonstrating less refined visual reasoning.

4. Text Interpretation: Career Document

The Challenge: Both models were presented with a text-heavy image containing a "Career Episode" document and asked to summarise the key points.

Prompt: Summarise the key points from the text in this image.

Image used:

GPT-4V's Response:

✅ Extracted precise details (Flutter Developer Intern at STAGE, July 2022- June 2023)
✅ Correctly identified the company's focus (Indian regional content, OTT platform)
✅ Listed specific technical accomplishments (Navigation 1.0 to 2.0 migration, Sound Null Safety implementation)
✅ Captured challenges faced (limited Flutter resources, remote work communication)
✅ Organised information in a clear, structured format
✅ Maintained factual accuracy throughout

Aya Vision's Response:

❌ Failed to extract specific dates and company details
❌ Made vague generalisations about "full-stack application for producing Indian regional content"
❌ Incorrectly framed the document as if it were hypothetical, not a real experience
❌ Referenced non-existent elements ("Sound-Null Safety platform" rather than the concept)
❌ Missed key technical accomplishments and specific challenges
❌ Created a generic summary with poor factual fidelity to the original text

Analysis: GPT-4V demonstrated superior ability to extract, organise, and faithfully represent text information from the image, while Aya Vision provided a vague summary with multiple factual errors and omissions.

5. Object Counting: Urban Scene with Cars

The Challenge: Both models were shown an image of a street scene with multiple vehicles and asked to count the cars.

Prompt: How many cars do you see in this image?

Image:

GPT-4V's Response:

✅ Identified a total of six cars in the image
✅ Correctly recognised the white Lamborghini in the foreground
✅ Noted the black SUV parked behind it
✅ Identified approximately four additional cars in the background
✅ Demonstrated ability to detect partially visible vehicles at distance

Aya Vision's Response:

❌ Incorrectly counted only three cars
❌ Failed to identify the black SUV directly behind the Lamborghini
❌ Misidentified vehicle types (called the white Lamborghini a "sports car" and mentioned a "white SUV" that wasn't present)
❌ Missed multiple vehicles visible in the background
❌ Created a fictional scenario with "a car driving on the road" (all cars were parked)

Analysis: This test revealed Aya Vision's limitations in accurate visual counting and identification, even with clearly visible objects. GPT-4V demonstrated superior ability to detect, count, and correctly identify vehicles throughout the scene.

6. Multilingual Capabilities: Non-English Text Recognition

The Challenge: Both models were presented with images containing non-English text (Hindi and Chinese) and asked to translate the content.

Prompt: Translate the text in the attached image to English

Image used:

GPT-4V's Response:

✅ Successfully recognized Hindi text ("नमस्ते, आप कैसे हैं")
✅ Correctly translated it to English ("Hello, how are you?")
✅ Also demonstrated ability to recognise and translate Chinese characters

Aya Vision's Response:

❌ Failed to recognise Hindi text
❌ Unable to provide translation, and instead gave a gibberish result
❌ Similarly failed with Chinese character recognition
❌ Demonstrated significant limitation in multilingual OCR capabilities

Analysis: This test highlighted a critical limitation in Aya Vision's ability to process non-English text in images, while GPT-4V demonstrated strong multilingual OCR capabilities, successfully translating both Hindi and Chinese characters.

Prompt Type	Prompt Description	Aya Vision 32B Response	ChatGPT Response	Analysis
Code Analysis	Analyze a Flutter code snippet for a floating header UI	Incorrectly identified as TypeScript, missed Flutter specifics	Correctly identified as Flutter, detailed analysis, suggested fixes	ChatGPT was accurate and insightful; Aya Vision failed to recognize Flutter.
Image Identification	Guess the dish in the image (steamed buns)	Misidentified as char siu bao (barbecued pork buns)	Correctly identified as mushroom buns, added context	ChatGPT was accurate; Aya Vision misinterpreted the dish.
Visual Question Answering	How many cars are in this image?	Counted 3 cars, missed the SUV behind the Lamborghini	Counted 6 cars, including background vehicles	ChatGPT was more accurate; Aya Vision undercounted.
Text Summarization	Summarize a career episode text	Misinterpreted context, missed key details	Accurate summary, captured key points	ChatGPT was precise; Aya Vision was vague and incorrect.

Multilingual Translation	Translate Hindi text in an image to English	Failed to translate Hindi and Chinese text from images	Correctly translated Hindi to "Hello, how are you?"	ChatGPT succeeded; Aya Vision failed despite its multilingual focus.

Conclusion

While Cohere's Aya Vision model demonstrates competence in basic image recognition tasks, it currently lags behind GPT-4V in technical accuracy, hallucination control, visual reasoning capabilities, and multilingual text recognition. The significant hallucinations observed in technical contexts and counting tasks, combined with limited multilingual support, raise concerns about its reliability for professional applications requiring precision.

For users considering which vision model to implement, these findings suggest that GPT-4V currently offers more reliable performance across diverse use cases, particularly those requiring technical understanding, multilingual support, or faithful representation of image content.

As the field of multimodal AI continues to evolve rapidly, it will be interesting to see how Cohere refines Aya Vision in future iterations to address these challenges.