The Scaling Laws That Predict Everything

A mouse heart beats 600 times a minute. An elephant's, 30. The mouse lives two years. The elephant, seventy. But here's the thing that should stop you cold: both get about one billion heartbeats total. So does a cat. So does a horse. So does a whale. This isn't a coincidence. It's a law. And the same law — the exact same mathematical pattern — shows up in cities, in companies, and now in artificial intelligence. If you take the pattern seriously, it tells you something specific about where AI is going.

The Mouse and the Elephant

Start with a simple question: why do big animals live longer?

The naive answer is that bigger bodies are somehow "tougher." But that's not it. A whale doesn't have stronger cells than a mouse. A whale cell and a mouse cell are roughly the same size, made of the same stuff, running the same chemistry.

The real answer is about rate. Small animals burn fast. Their hearts race, their metabolisms run hot, they wear out quickly. Big animals burn slow. Their hearts plod, their cells sip energy, they last.

But the ratio is eerily precise. When you multiply heart rate by lifespan for any mammal, you get approximately the same number: around one billion beats. It's as if every mammal is issued the same metabolic budget at birth and allowed to spend it at whatever pace it chooses.

The Billion Heartbeats

Animal mass: Dog

Drag the slider from mouse to elephant. Watch the heart rate change but the total beats stay roughly constant.

A mouse heart beats 600 times per minute and lives about 2 years. An elephant's heart beats 30 times per minute and lives about 70 years. Both accumulate approximately 1 billion total heartbeats across their lifespan. This relationship — heart rate proportional to mass^(-0.25) — holds across nearly all mammals, from a 25-gram mouse to a 100-tonne blue whale.

This pattern was noticed in the 1930s by a biologist named Max Kleiber. He wasn't looking at heartbeats — he was looking at metabolic rate. How much energy does an animal burn per day? And what he found was one of the most robust quantitative laws in all of biology.

II.

Kleiber's Law

Here's what Kleiber found. If you plot metabolic rate against body mass for different animals, you get a messy scatter. Big animals burn more energy — obviously. But how much more?

If metabolism were proportional to mass, you'd expect a straight line with slope 1. A 1000x heavier animal would burn 1000x more energy. But it doesn't. A 1000x heavier animal burns only about 180x more energy.

The relationship is: metabolic rate ∝ mass^0.75

That three-quarter exponent is the key. Not 1. Not 2/3 (which you'd get from surface area). Three-quarters. And the beautiful thing is, when you plot this on a log-log scale — taking the logarithm of both mass and metabolic rate — the messy scatter becomes a perfectly straight line.

Kleiber's Law — Metabolic Rate vs Body Mass

Click "Linear Scale" to see the mess. Click "Log Scale" to see the law. The dashed line has slope 0.75.

Kleiber's Law states that metabolic rate scales with body mass to the power of 0.75. On a linear plot, data from 13 animals (mouse through blue whale) appears as a messy scatter. On a log-log plot, the same data falls on a nearly perfect straight line with slope 0.75 — the signature of a power law discovered by Max Kleiber in the 1930s.

That straight line on a log-log plot is the signature of a power law. And once you know what to look for, you see it everywhere.

III.

Power Laws Are Everywhere

Geoffrey West, the physicist who spent decades studying scaling, calls power laws "the most fundamental regularities in nature." They show up in:

Cities. When a city doubles in population, its infrastructure — roads, gas stations, power lines — doesn't double. It grows by about 85%. That's a power law with exponent 0.85. But wages, patents, and GDP per capita grow by about 115%. Exponent 1.15. Cities are sublinear in stuff and superlinear in ideas.[1]

This was shown rigorously by Bettencourt, Lobo, Helbing, Kühnert, and West in a 2007 paper analyzing data from hundreds of cities across multiple countries. The exponents are remarkably consistent across cultures.

Companies. Revenue scales with employee count, but sublinearly. Companies get less efficient as they grow — the opposite of cities. The typical half-life of a Fortune 500 company is about 10.5 years.[2]

West and colleagues showed that companies, unlike cities, have bounded growth. They follow sigmoidal curves and eventually die. Cities almost never die.

Languages. The most common word in English ("the") appears about twice as often as the second most common ("of"), three times as often as the third, and so on. Zipf's law. Power law.

Earthquakes. The Gutenberg-Richter law: for every magnitude-5 earthquake, there are about ten magnitude-4 earthquakes. Power law.

The pattern is the same every time: a straight line on a log-log plot. The only thing that changes is the slope. And that slope — the exponent — tells you something deep about how the system works.

The exponent is the story. An exponent less than 1 (sublinear) means economies of scale — you need less per unit as you grow. Greater than 1 (superlinear) means increasing returns — each unit produces more. Exactly 1 means boring linear growth. The exponent tells you whether the system rewards scale or punishes it.

IV.

Enter the Neural Network

In January 2020, a team at OpenAI published a paper titled "Scaling Laws for Neural Language Models." It should have been front-page news. It wasn't.

What Kaplan, McCandlish, Henighan, and colleagues showed was this: the performance of a language model — measured as the cross-entropy loss on held-out text — follows a power law with respect to three variables:

1. The number of parameters in the model
2. The size of the training dataset
3. The amount of compute used for training

Each of these, plotted on a log-log scale against loss, gives a straight line. Just like Kleiber's animals. The model doesn't care whether you give it more parameters or more data or more compute — as long as you give it more of something, performance improves along a predictable power law.

Neural Scaling — Loss vs Compute

Compute (FLOPs): 10¹⁷

Drag the slider to add more compute. The loss drops along a clean power law. Each labeled point is a real model scale.

Kaplan et al. (2020) showed that language model cross-entropy loss follows a power law with respect to training compute: loss decreases predictably with exponent approximately -0.05. From 10^17 FLOPs (small models) to 10^25 FLOPs (GPT-4 scale), the relationship holds on a log-log plot as a straight line. Each 10x increase in compute yields a consistent decrease in loss.

This was a breakthrough not because it was surprising — people had noticed the trend — but because it was precise. You could write an equation, plug in a compute budget, and predict the loss before training the model. The curves didn't just fit the data. They predicted the data.

And the exponent? For compute vs. loss, it's about -0.05. That means every 10x increase in compute gives you a predictable decrease in loss. Not huge, but relentless. And it keeps going.

The Chinchilla Insight

There was a problem with the Kaplan scaling laws. They told you that more compute = lower loss, but they didn't tell you the best way to spend that compute. Should you make the model bigger? Or train it on more data?

In 2022, a team at DeepMind answered this question decisively. Their paper, "Training Compute-Optimal Large Language Models," showed that for a given compute budget, there's an optimal allocation between model size and training data. And most models were getting it wrong.

GPT-3 had 175 billion parameters, trained on 300 billion tokens. The Chinchilla result said: for the same compute budget, a 70 billion parameter model trained on 1.4 trillion tokens would perform better. Smaller model, way more data.

Compute Budget Allocator

Budget split → more params more data →

Drag the slider to allocate your compute budget. The curve shows predicted loss. The sweet spot isn't in the middle — and GPT-3 missed it.

The Chinchilla paper (Hoffmann et al., 2022) showed that for a fixed compute budget, there is an optimal split between model parameters and training data. GPT-3 used 175B parameters on 300B tokens — too many parameters, too little data. Chinchilla achieved better performance with 70B parameters on 1.4T tokens using the same compute. The optimal allocation favors more data than most labs expected.

The lesson: scaling laws aren't just about "throw more compute at it." They have structure. There's an optimal frontier, and knowing where it is lets you train better models for less money. Every major lab now uses some version of this analysis before committing to a training run.

VI.

Emergent Abilities

Here's where it gets weird. The scaling laws predict a smooth, continuous improvement in loss. But when you measure actual capabilities — can the model do arithmetic? can it write code? can it reason about analogies? — the improvement isn't smooth at all.

It's a step function.

Below a certain scale, the model can't do the task. Performance is essentially random. Then at some threshold — boom. The ability appears, almost fully formed. Wei et al. called these "emergent abilities" in a 2022 paper, and the word "emergent" was chosen carefully. These abilities weren't trained explicitly. They just... showed up, once the model was big enough.

Emergent Abilities — Accuracy vs Model Scale

Model size: 100M

Drag the slider to scale up the model. Watch different abilities "turn on" at different thresholds. Below the threshold: nothing. Above: capability.

Wei et al. (2022) documented emergent abilities in large language models — capabilities that appear abruptly once a model reaches a critical size. Simple Q&A emerges at small scales (~350M parameters), while chain-of-thought reasoning requires ~13B+ parameters, and complex math only appears at 175B+. Below each threshold, performance is near-random; above it, the ability appears almost fully formed.

This is both exciting and unsettling. Exciting because it means bigger models aren't just incrementally better — they can do qualitatively new things. Unsettling because you can't always predict which new abilities will emerge, or when.

A caveat. Some researchers argue that emergent abilities are partly an artifact of how we measure. If you use a continuous metric instead of a binary pass/fail, the transition looks smoother. This is a genuine debate. But even the skeptics agree that capabilities do improve with scale — the disagreement is about whether it's a phase transition or a steep sigmoid.

VII.

The Compute Trend

So: performance scales with compute according to a power law. Fine. But how fast is compute actually growing?

The answer is: absurdly fast. Epoch AI, a research group that tracks these things, estimates that training compute for frontier models has grown at roughly 4-5x per year since 2010. That's faster than Moore's Law. It comes from three sources: better chips (Moore's Law), more chips (bigger clusters), and more money (AI investment).

To put this in perspective: AlexNet (2012) used about 10¹⁷ FLOPs. GPT-3 (2020) used about 10^23.5. GPT-4 (2023) reportedly used around 10²⁵. That's a hundred-million-fold increase in a decade.

Training Compute Over Time

Projection growth rate: 4.2x/yr

Historical data points in solid circles. Adjust the growth rate to project forward. The dashed line is what hasn't happened yet.

Training compute for frontier AI models has grown at roughly 4-5x per year since 2010, faster than Moore's Law. AlexNet (2012) used approximately 10^17 FLOPs. GPT-3 (2020) used 10^23.5 FLOPs. GPT-4 (2023) reportedly used around 10^25 FLOPs — a hundred-million-fold increase in a decade, driven by better chips, larger clusters, and increased AI investment (Epoch AI).

The question everyone wants to answer is: will this trend continue? There are reasons it might slow down (power constraints, chip shortages, diminishing returns on investment). And reasons it might not (nation-state level investment, new architectures, algorithmic improvements that are equivalent to more compute).

VIII.

What the Curves Predict

Now we can chain the two laws together. We have:

1. Loss as a function of compute (the scaling law)
2. Compute as a function of time (the trend)

Combine them, and you get: loss as a function of time. And if you're willing to map loss to capabilities — which is hand-wavy but useful — you get a rough timeline for when AI systems reach various milestones.

This is, to be clear, an extrapolation. Extrapolations can break. The curves might bend. New bottlenecks might appear. But the striking thing is: so far, they haven't bent. Every time someone predicted the scaling laws would hit a wall, the next generation of models landed right on the curve.

When Does AI Match...?

Scaling exponent: -0.050

Adjust the scaling exponent to see how the timeline shifts. A steeper exponent (more negative) means faster progress. The threshold lines are illustrative, not precise.

Combining the scaling law (loss as a function of compute) with the compute growth trend (compute as a function of time) yields loss as a function of time. With a scaling exponent of -0.05 and a compute growth rate of 4.2x per year, the curves project when AI systems might cross capability thresholds. The exact dates depend sensitively on the assumed exponent, which is why this is an extrapolation — but it is a quantitative one.

The honest answer is: nobody knows exactly when these thresholds will be crossed, or if the curves will hold. But the scaling laws give you something you didn't have before — a quantitative framework for thinking about the question. Not vibes. Not hype. Math.

IX.

What If Intelligence Is Just Scale?

Here's the thing that should keep you up at night.

No fundamental architectural breakthrough was needed to go from GPT-2 to GPT-4. Same basic architecture — a transformer. Same training objective — predict the next token. The difference? More compute. More data. More parameters. Scale.

GPT-2 couldn't reliably do arithmetic. GPT-4 can pass the bar exam. The same architecture. Just bigger.

If the scaling laws hold — if intelligence really is, in some meaningful sense, a function of scale — then the trajectory is set. We know how fast compute is growing. We know the exponents. The rest is arithmetic.

But "if" is doing a lot of work in that sentence. There are serious reasons the scaling laws might not hold indefinitely:

Data. We may be running out of high-quality training data. The internet is large but finite, and the best data has already been scraped. Synthetic data might help, but nobody knows if it's a full substitute.

Diminishing returns on loss. Lower loss doesn't always mean proportionally better capabilities. There might be a region where loss improvements stop translating into useful new abilities.

The wrong metric. Cross-entropy loss on next-token prediction might not capture what we actually care about. A model can have excellent loss and still be unreliable, inconsistent, or unable to do genuinely novel reasoning.

Energy and economics. Training frontier models already costs hundreds of millions of dollars. At 4x/year growth, costs hit national-GDP levels within a decade. Something has to give — either the economics or the trend.

These are real constraints. But here's what's remarkable: people have been pointing out these constraints for years, and the curves keep going. Every predicted wall has been a speed bump so far. Maybe the next one is real. Maybe not.

The remarkable thing about scaling laws is they give you a prediction you can check. They don't ask you to take anything on faith. They say: if you spend X compute, you'll get Y loss. You can verify this. And so far, it checks out. Whether it keeps checking out is the most important empirical question in AI.

The mouse and the elephant are governed by the same law. The question is whether the neural network is too.

◆

Written by Danish Mohd.
AI product builder. Previously VP Engineering at Pixis AI.
Last updated March 2026