What Does an AI Actually Know How to Do?

Yesterday François Chollet and the ARC Prize team launched ARC-AGI-3, and the results from the preview phase are the most revealing benchmark numbers I’ve seen in a while — not because the AI scores are high, but because of what they’re high on.

ARC-AGI-1 and -2 tested static pattern recognition: given a few examples of an abstract transformation, can you apply it to a new case? The benchmark was famously resistant to LLMs, and it worked well as a forcing function for the field. ARC-AGI-3 changes the question entirely. Instead of static grids, it presents turn-based game environments — over 1,000 levels across 150+ hand-crafted environments — where neither the rules nor the win condition are ever stated. The agent must play moves, observe what happens, infer what the objectives might be, and eventually win. The benchmark measures not just whether you solve it, but how efficiently you do so compared to a human who’s also encountering it fresh.

In a 30-day preview phase, the best AI system scored 12.58%. Humans score 100%. That gap alone isn’t surprising — what’s surprising is which systems did best. Frontier LLMs, including GPT-5 variants, scored under 1%. The top performers were a CNN-based agent and systems using explicit graph search and structured state tracking. Not language models, not chain-of-thought reasoners. Agents that were built to explore state spaces systematically.

This is a significant data point. The things LLMs are good at — pattern completion, recalling what worked in similar-looking situations, following instructions — don’t help much when the situation is genuinely novel and interactive. You need to form a world model from scratch through exploration, update it with feedback, and commit to a goal you inferred rather than were told. That’s a different cognitive regime. The fact that structured search algorithms outperform trillion-parameter language models on this benchmark isn’t a knock on LLMs; it’s a reminder that LLMs are very good at specific things, and that those specific things don’t cover everything we mean by “intelligence.”

The competition runs through Kaggle with a $2M+ prize, submissions close November 2, and all solutions must be open-sourced under MIT or CC0. Worth watching how the ARC Prize 2025 dynamic plays out — last year ARC-AGI-2 went unsolved for its main prize. The ARC-AGI-3 design actively guards against memorization: environments are kept novel, no pre-loaded knowledge is permitted, and there are no natural language hints embedded in the tasks.

Separately, and more immediately: if your team uses LiteLLM, check your installations now.

Versions 1.82.7 and 1.82.8 were compromised by a threat group called TeamPCP, which previously targeted Trivy, Checkmarx’s KICS, and OpenVSX plugins. The attack chain here started with stolen credentials from the earlier Trivy breach, which gave the group access to a LiteLLM maintainer’s PyPI publishing account. They inserted a malicious .pth file — litellm_init.pth — into the package. Python automatically executes .pth files during interpreter startup when the package is installed, which means the payload runs in every Python process on the affected machine, not just ones that import LiteLLM directly.

The payload is thorough: it collects SSH keys, cloud credentials (AWS, GCP, Azure), Kubernetes configs, .env files, database passwords, shell history, and cryptocurrency wallet files, then encrypts them with AES-256-CBC and RSA-4096 before exfiltrating to models.litellm.cloud — a domain that looks legitimate but isn’t LiteLLM’s actual infrastructure. After exfiltration, it attempts to create privileged pods in Kubernetes’ kube-system namespace and installs a persistent backdoor at ~/.config/sysmon/sysmon.py. There’s also a C2 component worth noting: the malware deploys CanisterWorm, which uses Internet Computer Protocol (ICP) canisters as its command-and-control channel — the first observed use of ICP for C2 in a supply chain campaign. ICP canisters can’t be taken down by domain registrars or hosting providers, which makes this a meaningfully harder persistence mechanism to disrupt than a conventional C2 server.

The attack was discovered by FutureSearch not because anyone audited the package, but because an MCP plugin running inside Cursor pulled LiteLLM as a transitive dependency, and the malware’s fork bomb bug — subprocesses re-triggered the .pth file, spawning processes exponentially — crashed the machine. A lucky accident of detection. LiteLLM pulls about 3.4 million downloads per day; direct dependencies include CrewAI, Browser-Use, DSPy, Mem0, and Instructor, meaning the transitive exposure here is substantial.

The compromised versions have been yanked from PyPI. But uninstalling them isn’t sufficient remediation: if either version was ever installed in an environment, assume all credentials on that machine were stolen and rotate everything. Also check pip caches and Docker layer histories — a cached malicious package may persist after downgrade. Check for the persistence file at ~/.config/sysmon/sysmon.py and the associated systemd service. LiteLLM’s own security update and Snyk’s writeup have specifics.

The broader context matters here. TeamPCP didn’t gain access to LiteLLM directly: they compromised the Trivy security scanner first, then used Trivy’s CI/CD runner to steal LiteLLM’s PyPI publish token. LiteLLM’s pipeline used Trivy for security scanning and pulled it without a pinned version. The very tool meant to secure the supply chain was the entry point. This is the third successive compromise in the campaign — Trivy, then KICS, then LiteLLM — all using stolen credentials from the previous victim as the key to the next. The pattern is worth understanding as a template, because the AI tooling ecosystem has accumulated a lot of interconnected dependencies without commensurate scrutiny of that graph.

One more thing worth noting before the week is out: GitHub announced that starting April 24, Copilot will use interaction data to train AI models by default for Free, Pro, and Pro+ subscribers. This doesn’t affect Business and Enterprise users, who were already excluded. The opt-out is in Settings > Privacy. If you use Copilot and haven’t thought about whether you want your accepted completions and surrounding context flowing into training data, this is a reasonable moment to decide.