Exploring Grok 3: xAI’s homebrew Artificial Intelligence

Introduction

In February 2025, xAI unveiled Grok 3, a highly anticipated AI model touted as the “smartest AI in the world” (xAI). Developed to advance scientific discovery and deepen our understanding of the universe, Grok 3 promises superior reasoning, real-time data processing, and versatile applications. This literature review synthesizes insights from 10 articles and blog posts to explore what Grok 3 is, its key features, performance, use cases, and user experiences, offering a balanced perspective on its capabilities and limitations.

What is Grok 3?

Grok 3 is xAI’s third-generation AI model, launched on February 17, 2025, as a successor to Grok 2. According to xAI, it was trained on the Colossus supercluster with 10 times the computational power of previous models, utilizing 200,000 NVIDIA H100 GPUs (Anthem Creation). Elon Musk, xAI’s founder, claims Grok 3 is “an order of magnitude more powerful” than its predecessor, designed to outperform competitors like OpenAI’s GPT-4o and Google’s Gemini (CNET). Its training dataset reportedly includes 280 billion multilingual text tokens, 14 million hours of annotated educational videos, and legal archives from 12 countries, though exact details remain undisclosed (Anthem Creation).

Grok 3 is accessible via X Premium+ subscriptions ($40/month), a standalone SuperGrok subscription ($30/month), or through web and mobile apps (TechCrunch). It also powers features on X, xAI’s social platform, and supports an API for developers, priced at $3 per million input tokens (TechCrunch API).

Key Features and Capabilities

Grok 3 stands out for its advanced reasoning, large context window, and innovative tools. Below are its primary features:

Reasoning Capabilities

Grok 3 employs large-scale reinforcement learning to refine its reasoning, allowing it to “think” for seconds to minutes, correct errors, and explore alternative solutions (xAI). It offers two operational modes:

Think Mode: Executes 48 cognitive steps with real-time cross-validation for accurate responses.
Big Brain Mode: Utilizes 12.8 TB VRAM per request, achieving 99.4% accuracy on the GPQA benchmark (Anthem Creation).

Performance on Benchmarks

Grok 3 demonstrates strong performance across academic benchmarks, as shown in the recreated table below, based on data from Anthem Creation:

Benchmark	Grok 3	GPT-4.5	Gemini Ultra
AIME (Maths)	92.1%	89.7%	88.3%
GPQA (Science)	94.6%	91.2%	89.8%
Codex (Python)	89.3%	87.1%	85.6%
MMLU (General)	93.8%	94.1%	92.4%

Grok 3 also achieved an Elo score of 1402 in the Chatbot Arena, indicating strong user preference (xAI). However, it shows 37% lower performance variance across domains, suggesting consistent reliability.

Additional comparison performance

For further comparison visualization between Grok and ChatGPT, I have given it a prompt for testing. The prompt being:

Generate Python code using matplotlib to create a bar chart with the following population data: {'China': 1444216107, 'India': 1393409038, 'United States': 332915073, 'Indonesia': 276361783, 'Pakistan': 225199937}.

This is the result that was given from Grok

*This image is generated from* *xAI’s grok-3*.

And this was the result generated from ChatGPT

*This image is generated from OpenAI’s* *ChatGPT 4o*.

DeepSearch and Multimodality

Grok 3’s DeepSearch tool enables real-time web crawling and information synthesis, analysing millions of pages with 82% accuracy compared to Gemini’s 67% (Anthem Creation). Its cognitive multimodality allows it to process text, images, and symbolic reasoning, solving handwritten equations and generating physical simulations (Anthem Creation).

Grok 3 Mini

xAI also released Grok 3 Mini, a cost-efficient version for STEM tasks, with performance metrics like 95.8% on AIME’24 when using Think Mode (xAI).

Use Cases and Applications

Grok 3’s versatility supports a range of applications:

Scientific Research: DeepSearch aids in synthesising real-time scientific data, useful for researchers (xAI).
Coding and Development: Users report success in code debugging and Python script generation (Latenode).
Business Analysis: Grok 3 provides nuanced analyses, such as evaluating AI regulations (Latenode).
Creative Tasks: It can write in specific styles (e.g., Hemingway’s voice) and generate images, though creativity is sometimes predictable (Latenode).
Government Use: The U.S. Department of Government Efficiency reportedly used Grok 3 for policy work (Wikipedia).

Potential future applications include brain-machine interfacing, combining local and cloud-based models for cognitive enhancement (Life Architect).

Comparing Grok 3 to Other Models

Grok 3 is positioned as a rival to GPT-4o, Claude 3.5 Sonnet, DeepSeek, and Gemini. While xAI claims superiority, reviews offer mixed perspectives:

Helicone notes Grok 3’s 1-million-token context window matches Gemini 2.5 and GPT-4.1, but its real-world performance varies (Helicone).
Decrypt found Grok 3 excels in reasoning but doesn’t consistently outperform competitors across all tasks (Decrypt).
Lifehacker argues that Grok 3’s improvements are incremental, not revolutionary, and its price hike may not be justified (Lifehacker).

Others’ Experiences with Grok 3

User reviews highlight both strengths and weaknesses:

Latenode praises Grok 3’s “absurdly fast” responses and 89.7% sarcasm detection accuracy but notes predictable creative outputs and cautious moral guardrails (Latenode).
Writesonic tested over 100 prompts, confirming Grok 3’s contextual depth but identifying quirks like occasional inaccuracies (Writesonic).
Tom’s Guide appreciated Grok 3’s nuanced responses to complex prompts, such as economic analyses, but criticised its use of assumptions over definitive data (Tom’s Guide).
Lifehacker found Grok 3 prone to hallucinations, similar to other models, and questioned its value given the cost (Lifehacker).
Unite.AI lauded Grok 3’s near-human reasoning but noted its reliance on a costly infrastructure (Unite.AI).

Ethical and Technical Challenges

Some sources raise concerns about Grok 3’s transparency and ethical alignment. The model’s estimated 1.8 trillion parameters and undisclosed synthetic data share spark reproducibility issues (Anthem Creation). Additionally, its adaptive personalisation, which adjusts responses based on user interaction history, raises privacy questions (Anthem Creation).

Conclusion

Grok 3 represents a significant step forward for xAI, with its advanced reasoning, real-time data access, and strong benchmark performance positioning it as a formidable competitor in the AI landscape. However, mixed user reviews and concerns about cost and transparency suggest it may not fully meet the hype for all users. For researchers, developers, and enthusiasts, Grok 3 offers exciting possibilities, but its value depends on specific use cases and budget considerations. As xAI continues to refine the model, it will be fascinating to see how Grok 3 shapes the future of AI.