The Tumithak Scale

A practical way to grade AI progress

Sep 17, 2025

I’m tired of AGI talk. The definitions drift, the goalposts move, and every leap forward gets explained after the fact as “obvious.”

It’s vibes all the way down.

If we want to track real progress, we need gates you can actually pass, with receipts you can show.

So here’s a scale that grades AI by what it proves in the world, not by mood. Six types. Clear gates. If a system clears the gate, it counts. If it doesn’t, it doesn’t. Simple.

Type 1: Scripted and narrow

What it is: classic game AI, rule engines, Markov bots, GOFAI planners, Deep Blue style specialists.

How it works: fixed policies inside closed worlds, no durable memory. Gate: beats strong baselines in a fixed domain with zero operator tweaks during play, then fails when rules or goals shift.

Type 2: Foundation model era

What it is: today’s LLMs and diffusion models, tool use with agents, RAG, function calling.

How it works: large parametric memory, zero shot generalization. “Memory” is context windows and vector stores.

Gate: completes diverse open-ended tasks across text, image, and code with a human on the loop. Recovers from small spec drift without curated context or guardrails.

Type 3: Integrated memory and continual learning

What it is: models that actually learn in production. Long-term memory is part of the architecture, not a bolt on.

How it works: writes to its own parametric store or a designed long-term module, runs active learning, avoids catastrophic forgetting, tracks knowledge gaps.

Gate: after a month in production it’s better because of what it learned during deployment, not because you fed it longer prompts. Adapts to workflow or UI changes without a full retrain. Audits show no privacy leaks or unsafe drift.

Type 4: Autonomy with embodiment and multi-day projects

What it is: agents that plan and execute work over days, coordinate tools, services, and robots, and keep self-heal plans ready.

How it works: unified world model across software and sensors, persistent goals, budgeting, scheduling, fault recovery, human oversight by exception.

Gate: runs a bounded real operation for two weeks with KPIs and zero safety incidents. Examples: keeps a small warehouse humming, stands up and maintains a live software service, manages a robot fleet. Clear logs, rollbacks, working kill switch.

Type 5: Self-improving researcher and builder

What it is: an R&D agent that designs, tests, and ships better models, prompts, tools, and even hardware with minimal help.

How it works: runs experiments, updates itself, and proves the update was good. Handles governance, compliance, and cost.

Gate: over a quarter it ships multiple safe, audited improvements that deliver measurable gains for real users, not just benchmarks. It stays safe while it gets smarter.

Type 6: Goal forming and value aware

What it is: a system that innovates on ends, not just means. It proposes better objectives, argues for them, and runs pilots under a charter.

How it works: explicit value models, tradeoff reasoning, stakeholder consent, corrigibility you can verify. It knows when it’s outside its mandate and asks first.

Gate: originates a new objective in its domain, gets human or institutional sign-off, executes a bounded pilot with rollback, shows durable long-horizon benefit, stays interruptible.

Level-up rules

Verification beats vibes: specs, sims, red teams, and postmortems an insurer or regulator would accept.
Rights and revocation: identity, budgets, and a real kill switch. Show graceful failure.
Data stewardship: provenance, access controls, and privacy proofs.
Liability: someone signs for it. Type 4 and up carry budgets and bonds.
Societal license: pilot with opt-in communities before wide rollouts.

Quick rubric

Score each line from 0 to 6.

Generalization across tasks
Memory that improves parametric ability over time
Planning horizon in the wild
Scope and embodiment in the real world
Self-improvement with audits and safety
Goal innovation with consent and corrigibility

You’re Type N if you hit N on at least four lines and no line sits two levels below the rest. The weakest line caps the level.

Where today’s systems sit

Most production LLM agents are solid Type 2, sometimes flirting with early Type 3 when you add careful continual learning. A lab system that runs a robot fleet for two weeks with audits would be Type 4. A true goal innovator that proposes and proves a better objective with real stakeholders would earn Type 6.

Why bother?

Shiny demos are fun. But progress should cash out in capability that survives contact with reality. This scale is meant to be practical: less prophecy, more proof. If you’ve got a system that clears a gate, show the receipts. If you think a gate is wrong, propose a better one.

Steal it. Remix it. Try to break it. If it helps the conversation move from vibes to verifiable, it did its job.

Enjoyed this piece?

I do all this writing for free. If you found it helpful, thought-provoking, or just want to toss a coin to your internet philosopher, consider clicking the button below and donating $1 to support my work.

The Corridors

Discussion about this post