Anthropic used Pokémon to benchmark its newest AI model

[ad_1]

Anthropic used Pokémon to benchmark its newest AI model. Yes, really.

In a blog post published Monday, Anthropic said that it tested its latest model, Claude 3.7 Sonnet, on the Game Boy classic Pokémon Red. The company equipped the model with basic memory, screen pixel input, and function calls to press buttons and navigate around the screen, allowing it to play Pokémon continuously.

A unique feature of Claude 3.7 Sonnet is its ability to engage in “extended thinking.” Like OpenAI’s o3-mini and DeepSeek’s R1, Claude 3.7 Sonnet can “reason” through challenging problems by applying more computing — and taking more time.

That came in handy in Pokémon Red, apparently.

Compared to a previous version of Claude, Claude 3.0 Sonnet, which failed to leave the house in Pallet Town where the story begins, Claude 3.7 Sonnet successfully battled three Pokémon gym leaders and won their badges. 

Anthropic Pokemon Red
Image Credits:Anthropic

Now, it’s not clear how much computing was required for Claude 3.7 Sonnet to reach those milestones — and how long each took. Anthropic only said that the model performed 35,000 actions to reach the last gym leader, Surge.

It surely won’t be long before some enterprising developer finds out.

Pokémon Red is more of a toy benchmark than anything. However, there is a long history of games being used for AI benchmarking purposes. In the past few months alone, a number of new apps and platforms have cropped up to test models’ game-playing abilities on titles ranging from Street Fighter to Pictionary.

[ad_2]

Source link

See also  The Anonymous YouTubers Street-Racing Through New York

Related posts:

Stay Safe Online: Essential Tips for Safer Internet Day

Is Your Phone Your Best Friend or a Silent Spy?

Wipe Your Digital Footprints with Data Wipe Software

No, you’re not fired – but beware of job termination scams

DeceptiveDevelopment targets freelance developers

Fake job offers target coders with infostealers

Belarus-Linked Ghostwriter Uses Macropack-Obfuscated Excel Macros to Deploy Malware

LightSpy Expands to 100+ Commands, Increasing Control Over Windows, macOS, Linux, and Mobile

CISA Adds Microsoft and Zimbra Flaws to KEV Catalog Amid Active Exploitation

Malicious PyPI Package "automslc" Enables 104K+ Unauthorized Deezer Music Downloads

CERT-UA Warns of UAC-0173 Attacks Deploying DCRat to Compromise Ukrainian Notaries

Three Password Cracking Techniques and How to Defend Against Them

New Linux Malware ‘Auto-Color’ Grants Hackers Full Remote Access to Compromised Systems

SOC 3.0 - The Evolution of the SOC and How AI is Empowering Human Talent

Leaked Black Basta Chat Logs Reveal $107M Ransom Earnings and Internal Power Struggles

Microsoft: Russian-Linked Hackers Using 'Device Code Phishing' to Hijack Accounts

AI-Powered Social Engineering: Ancillary Tools and Techniques

Lazarus Group Deploys Marstech1 JavaScript Implant in Targeted Developer Attacks

New “whoAMI” Attack Exploits AWS AMI Name Confusion for Remote Code Execution

Android's New Feature Blocks Fraudsters from Sideloading Apps During Calls

New Golang-Based Backdoor Uses Telegram Bot API for Evasive C2 Operations

⚡ THN Weekly Recap: Google Secrets Stolen, Windows Hack, New Crypto Scams and More

CISO's Expert Guide To CTEM And Why It Matters

South Korea Suspends DeepSeek AI Downloads Over Privacy Violations

Microsoft Uncovers New XCSSET macOS Malware Variant with Advanced Obfuscation Tactics

Cybercriminals Exploit Onerror Event in Image Tags to Deploy Payment Skimmers

New Xerox Printer Flaws Could Let Attackers Capture Windows Active Directory Credentials

Winnti APT41 Targets Japanese Firms in RevivalStone Cyber Espionage Campaign

Juniper Session Smart Routers Vulnerability Could Let Attackers Bypass Authentication

Debunking the AI Hype: Inside Real Hacker Tactics

Leave a Reply

Your email address will not be published. Required fields are marked *