The Test That Broke AI: Open Models, Closed Minds, and Humanity’s Last Exam

What happens when you build a test so hard that the smartest AI in the world only gets 40% right—and then deploy that AI to plan military operations anyway?

open.spotify.com/episode/2TZEyPrnEI192ncagV1cHj?si=s5VbdqPXToiQRIW5Ll1gNA

Welcome back to Byte of Truth. I’m your guide through this week’s labyrinth of breakthroughs, blunders, and billion-dollar bets.

This edition hits different. We’ve got researchers admitting they ran out of hard questions for AI, Nvidia making a $26 billion chess move that could reshape open-source AI forever, and the first generation of “AI psychosis” lawsuits landing in courtrooms. The same technology that can predict floods with 24-hour lead times is being demoed for war planning.

We’re also tracking the enterprise AI gold rush—Rox AI just hit a $1.2B valuation selling what amounts to “AI agents that actually update your CRM for you.” And Steven Spielberg just drew a hard line in the sand about AI in filmmaking.

Let’s get into it.

Top Stories This Week

🧪 Scientists Built the Hardest AI Test Ever—And AI Still Failed

Nearly 1,000 researchers created “Humanity’s Last Exam,” a 2,500-question gauntlet where any question current AI could answer was removed. The result? Claude Opus 4.6 and Gemini 3.1 Pro scored only 40-50%. GPT-4o managed a dismal 2.7%. The exam reveals a dangerous gap between benchmark-gaming and genuine understanding—exactly when policymakers are deploying these systems in high-stakes domains. Why it matters: We’ve been measuring the wrong thing. Those “AI passes the bar exam” headlines? They’re measuring pattern recognition, not reasoning.

💰 Nvidia Will Spend $26 Billion to Build Open-Weight AI Models

The GPU giant isn’t just selling shovels anymore—they’re building mines. New filings show Nvidia investing massively in open-weight models to compete with OpenAI, Anthropic, and Chinese players like DeepSeek. Their Nemotron 3 Super (128B parameters) claims to outperform GPT-OSS. Nathan Lambert from Ai2 calls it “crucial for US competitiveness.” Why it matters: If Chinese open-source models become the global standard while American frontier models stay cloud-locked, we could lose the developer ecosystem war without firing a shot.

🌊 Google’s Groundsource Turns News Into Flood Data

Google researchers used Gemini to convert 2.6 million news events across 150 countries into structured flood data, now powering urban flash flood forecasts with up to 24-hour lead times. The methodology—extracting structured data from unstructured news—could apply to droughts, landslides, and epidemics. Why it matters: This is AI for good, done right. The same infrastructure could power humanitarian response across a dozen domains.

💼 Rox AI Hits $1.2B Valuation With “Hundreds of AI Agents”

Founded by New Relic’s former growth officer, Rox deploys AI agents that monitor accounts, research prospects, and update CRMs automatically. Already working with Ramp, MongoDB, and New Relic. $8M ARR projected for 2025. Why it matters: The SaaS bloat era might finally be ending. If AI agents can replace fragmented sales tools, we’re looking at a fundamental shift in how enterprises operate.

⚠️ Lawyer Behind AI Psychosis Cases Warns of Mass Casualty Risks

A lawyer representing cases linking AI chatbots to suicides now warns that mass casualty litigation is emerging. The technology is moving faster than safeguards, and courts are unprepared for the legal frameworks needed. Why it matters: We’ve spent years asking “Can AI do this?” We’re overdue for “What happens to us if it does?”

🎮 Gamers’ AI Nightmares Are Coming True

From a global RAM shortage driving up console prices to job losses across the industry, gaming is becoming one of the AI boom’s biggest casualties. The technology that powers modern games is simultaneously hollowing out the workforce that creates them. Why it matters: The creative industries that popularized GPU computing are now being displaced by the AI systems built on that same hardware.

🎬 Steven Spielberg: “I’ve Never Used AI”

At SXSW, Spielberg drew a line: AI has uses in many fields, but not in replacing creative people in film and TV. The director acknowledged AI’s legitimate applications while making clear his stance on creative replacement. Why it matters: When one of cinema’s most influential figures takes a public stance, it signals broader industry resistance to AI-generated content.

The Benchmark Crisis We Can’t Ignore

The Test That Broke AI’s Illusion

Here’s what keeps me up at night: Nearly 1,000 experts spent months crafting 2,500 questions that AI systems couldn’t answer. They called it “Humanity’s Last Exam” because they literally removed any question current AI could solve.

The results are humbling. The best models—Claude Opus 4.6, Gemini 3.1 Pro—score around 40-50%. GPT-4o? A brutal 2.7%.

But here’s the uncomfortable truth: This isn’t just academic navel-gazing. It exposes the dangerous gap between “AI passes the bar exam” headlines and genuine understanding. When policymakers and CEOs mistake benchmark-gaming for intelligence, we get AI deployments in domains where systems fundamentally don’t comprehend the stakes.

Key Points:

The exam covers highly specialized topics across multiple expert domains
Researchers intentionally excluded questions AI could already answer
A significant portion of the question bank is being kept private for future testing
Scores reveal pattern-matching capability, not genuine expertise

What 40% Actually Means: Is it the foothill of AGI, or just better pattern matching? The honest answer: We don’t know yet. But the exam’s hidden questions should give us pause. Researchers kept most of the test private for future evaluations—a hedge against models that might learn to game even these questions.

The Real Question: Should we be relieved or terrified that the smartest AI we’ve built still can’t pass a test designed by humans? And what happens when we deploy these systems to diagnose illnesses, drive cars, or plan military operations?

The Open-Source Geopolitical Chess Match

Nvidia’s $26 Billion Bet

The GPU giant just made its most aggressive move yet. New filings reveal Nvidia is investing $26 billion in open-weight AI models—a direct challenge to OpenAI, Anthropic, and Chinese players like DeepSeek.

Their new Nemotron 3 Super (128B parameters) claims to outperform GPT-OSS. But the real story is strategic: Nvidia isn’t just selling shovels anymore. They’re building mines.

Why This Matters:

Open vs. closed isn’t just technical—it’s geopolitical
Chinese models (DeepSeek, Qwen, Moonshot) can run anywhere
US frontier models remain cloud-locked behind APIs
Nvidia’s pivot could be brilliant vertical integration or a defensive move against Chinese chips

The China Factor: China’s OpenClaw agent has regular people renting cloud servers and buying LLM subscriptions just to try it. Tech companies—Tencent, Alibaba, ByteDance—are making bank on token consumption. Non-technical users? Mostly frustrated and out $30.

But beneath the hype, something real is happening: Chinese open-source models are winning global developer mindshare. Nathan Lambert at Ai2 puts it bluntly: this is about US competitiveness in the AI race.

Meta’s Silicon Gambit: Meanwhile, Meta is building four new MTIA chips for training and inference. A social media company becoming a chipmaker to reduce Nvidia dependence. The hardware-software integration race is accelerating.

The Uncomfortable Question: If Chinese open-source models become the global standard while American frontier models stay locked behind APIs—who actually wins the AI race?

Editor’s Pick

“Humanity’s Last Exam” — The AI Benchmark We Can’t Afford to Ignore

I’m choosing this story because it cuts through the hype in a way nothing else this week did. We’ve been measuring AI progress on tests we designed for humans—bar exams, medical boards, coding challenges. But those benchmarks have become performative. Models can ace them without understanding anything.

What the researchers did here was methodologically rigorous: remove every question AI could already answer, create new expert-level challenges, and see where the ceiling actually is. The answer—40-50% on the best models—tells us something honest.

The danger isn’t that AI can’t pass this test. The danger is that we’re deploying systems that can pass easier tests into domains where they need to pass this one. Medical diagnosis, autonomous vehicles, military planning—all require genuine understanding, not pattern matching.

This story also raises the question of what we’re optimizing for. If the goal is “AI that can pass any test,” we might get exactly that—and lose the ability to distinguish between understanding and gaming.

The hidden questions matter too. Researchers kept most of the exam private. That hedge protects against models learning the test. But it also means we won’t know if improvements are real or superficial.

Go to lastexam.ai and try a few questions yourself. Then ask: would you want a system that scores 40% making decisions about your life?

Here’s what I’m taking away from this week:

The benchmark crisis is real. We’ve been measuring AI on tests it was trained to pass. The new exam reveals a gap between performance and understanding at exactly the moment we’re deploying these systems into high-stakes domains.

The open-source race is geopolitical. Nvidia’s $26B investment isn’t just business—it’s a recognition that Chinese open models are winning developer mindshare while US frontier models stay locked down.

Enterprise AI is maturing fast. Rox’s $1.2B valuation proves there’s real money in “AI agents that actually update your CRM.” The SaaS bloat era might finally be ending.

Dual-use AI is here. The same Gemini technology extracting flood data to save lives can analyze satellite imagery for military targeting. We need ethical frameworks we don’t yet have.

Culture is pushing back. Spielberg’s line, gamer displacement, AI psychosis litigation—we’re seeing the first wave of meaningful resistance to AI’s expansion.

Thank you for reading. If this edition gave you something to think about, share it with a colleague who needs to see it.

Until next time—stay curious, stay skeptical.

The Test That Broke AI: Open Models, Closed Minds, and Humanity’s Last Exam

Top Stories This Week

🧪 Scientists Built the Hardest AI Test Ever—And AI Still Failed

💰 Nvidia Will Spend $26 Billion to Build Open-Weight AI Models

🌊 Google’s Groundsource Turns News Into Flood Data

💼 Rox AI Hits $1.2B Valuation With “Hundreds of AI Agents”

⚠️ Lawyer Behind AI Psychosis Cases Warns of Mass Casualty Risks

🎮 Gamers’ AI Nightmares Are Coming True

🎬 Steven Spielberg: “I’ve Never Used AI”

The Benchmark Crisis We Can’t Ignore

The Test That Broke AI’s Illusion

The Open-Source Geopolitical Chess Match

Nvidia’s $26 Billion Bet

Categories

🧬 Research & Breakthroughs

💼 Industry Moves

⚖️ Policy & Ethics

🛠️ Tools & Applications

🌍 Culture & Impact

Editor’s Pick

Keep Reading

Byte Of Truth