In August 2024, Sakana Labs released a system called The AI Scientist. It could generate research ideas, write code, run experiments, produce figures, draft a full paper, and review its own work. The whole pipeline cost about $15 per paper.

The reaction was predictable. Some called it the end of science as we know it. Others pointed out the papers were terrible. Both groups were wrong in interesting ways.

The $15 paper and its problems

The AI Scientist v1 worked in a narrow domain: machine learning research on topics like diffusion models and grokking. It followed a template: take a human-authored code scaffold, generate a hypothesis, modify the code, run experiments, plot results, write a LaTeX paper, then use an LLM-based reviewer to score it.

The claimed result was that some generated papers exceeded the acceptance threshold of a top ML conference, as judged by its own automated reviewer. This is like grading your own exam.

An independent evaluation in February 2025 told a different story. 42% of experiments failed due to coding errors. The literature review was simplistic keyword search, not genuine synthesis. Well-established concepts like micro-batching for SGD were incorrectly flagged as novel. In one case, the system couldn't correctly compare the magnitudes of two numbers.

These aren't minor issues. They're fundamental. A scientist who can't tell whether their result is new, can't get their code to run half the time, and can't compare two numbers isn't really a scientist.

Then v2 passed peer review

In April 2025, Sakana released The AI Scientist v2. The architecture changed significantly. Instead of following a human-authored template, v2 used agentic tree search: an experiment manager agent explores multiple research directions in parallel, pruning dead ends and doubling down on promising paths. A vision-language model provided iterative feedback on figure quality. The system could generalize across ML domains, not just the specific subfields v1 was scaffolded for.

Sakana submitted three AI-generated papers to an ICLR 2025 workshop. One received reviewer scores of 6, 7, and 6, placing it in roughly the top 45% of submissions. It was accepted. This was the first fully AI-generated paper to pass actual peer review at a major conference.

But context matters. The workshop track was "I Can't Believe It's Not Better: Challenges in Applied Deep Learning," designed to have lower requirements than standard tracks. The authors themselves stated that none of the three papers met their internal bar for an ICLR conference-track paper. Passing a workshop is not the same as passing the conference. Still, the gap between v1 (fails its own review) and v2 (passes external review) closed remarkably fast.

Everyone else showed up

What happened next was more interesting than The AI Scientist itself. By early 2026, automated scientific discovery had become a crowded field.

Google released AI Co-Scientist in March 2025, a multi-agent system powered by Gemini 2.0 that positioned itself not as a replacement for scientists but as a collaborator. The results were striking because they were validated in the real world. It proposed a novel gene transfer mechanism for antimicrobial resistance that was independently confirmed by researchers at Imperial College London, who had spent years studying the same question. It identified drug candidates for liver fibrosis: three novel epigenetic modifiers, two of which showed confirmed anti-fibrotic activity in hepatic organoids without toxicity, validated at Stanford. It proposed drug repurposing candidates for acute myeloid leukemia that inhibited tumor viability at clinically relevant concentrations.

These aren't paper results. These are drugs that work in lab tests. That's a different category entirely.

AI-Researcher, from the University of Hong Kong, earned a NeurIPS 2025 Spotlight. It runs a three-stage pipeline: literature review and idea generation, algorithm design and implementation, and manuscript preparation. It was evaluated against 22 benchmark papers, and its contributions were described as "frequently approaching human-level quality."

The Allen Institute for AI launched AutoDiscovery in February 2026, built on their Asta open-science platform. AutoDiscovery takes a fundamentally different approach: instead of starting with a hypothesis, it starts with data. It generates hypotheses in natural language, writes and executes Python code to test them, interprets the results, and uses those findings to generate new hypotheses. It searches across 108 million academic abstracts and 12 million full-text papers. The output is a reproducible list of research directions, each with code and traceable evidence paths.

The math got solved too

While automated paper-writing was getting the headlines, automated mathematical reasoning was making arguably more profound progress.

DeepMind's AlphaProof, combining reinforcement learning with Lean 4 formal proofs, solved 4 of 6 problems at the 2024 International Mathematical Olympiad, earning silver medal status (28/42 points). One of those was the competition's hardest problem, solved by only five human contestants.

One year later, Gemini Deep Think solved 5 of 6 IMO 2025 problems, scoring 35/42 for gold medal standard. It worked entirely in natural language within the 4.5-hour time limit. IMO coordinators graded the solutions using the same criteria as for students and described them as "clear, precise and most of them easy to follow." Silver to gold in one year.

FunSearch, published in Nature in December 2023, discovered genuinely new mathematical constructions for the cap set problem, going beyond the best-known human results. It found these by searching in program space rather than solution space, which made the outputs interpretable and verifiable.

AI reviewing AI

One underappreciated piece of all this is the peer review side. If AI systems are writing papers, we need to know whether AI systems can evaluate them.

Stanford's Agentic Reviewer was tested on 297 ICLR 2025 submissions. The AI-human Spearman correlation was 0.42. The human-human Spearman correlation was 0.41. That is, the AI reviewer's agreement with humans is essentially identical to humans' agreement with each other. The AI's acceptance prediction AUC was 0.75 versus 0.84 for a single human reviewer, so it's not quite there on binary accept/reject, but the gap is smaller than most people assume.

ICLR 2025 itself deployed AI review tools across 10,000+ submissions. The tools suggest ways reviewers could be more specific and constructive. The known limitation: LLMs are significantly less likely than humans to comment on novelty. They're biased toward checking technical validity, which is easier to verify from the text alone.

There's also an adversarial risk. Researchers have shown that hidden instructions can be embedded in manuscripts to game AI reviewers. This is the same prompt injection problem that plagues every LLM system, just in academic clothing.

What actually changed

The honest summary of where things stand in early 2026:

AI can produce papers that pass workshop-level peer review, but not conference-level. It can identify drug candidates that work in organoids, but not in clinical trials. It can earn IMO gold medals. It can review papers about as well as a single human reviewer.

The limitations that persist are more interesting than the capabilities. Novelty detection remains fundamentally weak. AI systems still struggle to determine whether an idea is genuinely new versus a rediscovery of known results. Hallucination risk means all systems require external verifiers; unsupervised discovery is unreliable. Current systems work in ML and CS; extending to wet-lab sciences requires physical automation that doesn't exist yet. And every serious project, from Google to Ai2, explicitly positions as "co-scientist," not replacement.

But the trajectory matters more than the snapshot. The AI Scientist v1 was a demo with a 42% experiment failure rate. Eighteen months later, we have validated drug candidates, IMO gold medals, and peer-reviewed publications. The question isn't whether AI will do real science. It's how long until the gap between "workshop paper" and "Nature paper" closes the way the gap between "self-graded demo" and "peer-reviewed publication" already did.


References