By John J Masse — Apr 7, 2026

Before You Write the Check for a Million Tokens

The frontier model companies aren't your only option. The economics of self-hosted AI are already better than most people realize.

Jensen Huang was at GTC 2026 and said something that got a lot of attention. He proposed that every engineer at Nvidia should receive a token budget worth half their base salary — on top of their pay. For a $500,000 engineer, that's $250,000 a year in tokens. If they only spent $5,000, he said he'd "go ape."

It was a great soundbite. It made headlines. And if you're an enterprise leader trying to figure out your AI strategy, it probably made you wonder whether you're spending enough.

Before you write the check, I want to paint a different picture. Not because Jensen is wrong about the value of AI — he's right that these tools can amplify what engineers do. But because the framing assumes a world where the only way to access this capability is through frontier model APIs, where every token is metered, and where your entire AI strategy flows through someone else's infrastructure.

There are subtle developments shaping the world's long-term approach to applied AI and tools.

Gemma 4 Hints at a Different Future

On April 2, 2026, Google DeepMind released Gemma 4 — and it's a significant moment for anyone thinking about self-hosted AI.

Gemma 4 (try it here) ships in four sizes: a 31B dense model, a 26B Mixture-of-Experts model (which only activates roughly 3.8B parameters per token, making it remarkably efficient to run), and two edge-optimized models — E4B and E2B — designed to run on phones, laptops, and devices like a Raspberry Pi.

There are several factors that make this release paint a different future:

The performance is genuinely competitive. The 31B model ranks #3 globally among open models on Arena AI. On math benchmarks, it jumped from 20.8% to 89.2% compared to Gemma 3. On coding benchmarks, 29.1% to 80.0%. These models are capable reasoners with native function calling, multi-step planning, structured output, and configurable chain-of-thought.

The models handle text, images, and audio natively. Variable aspect ratio image understanding, document parsing, chart comprehension, OCR, video understanding up to 60 seconds — all built in, not bolted on.

The license is Apache 2.0. No monthly active user caps. No acceptable-use restrictions. Full commercial freedom. You can fine-tune them, deploy them, build products on them, and owe Google nothing.

And they're designed to run on hardware you can buy. The E2B model runs with under 2GB of RAM when quantized for edge deployment. The 26B MoE model, despite its 25.2B total parameters, runs closer to a 4B model because of how the expert routing works. The 31B model runs comfortably on a machine with 48GB of unified memory — which, as it happens, is exactly what a Mac Mini M4 Pro provides.

This isn't the first capable open model. Llama, Qwen, DeepSeek — the ecosystem has been building for a while. But Gemma 4 feels like an inflection point because it combines frontier-competitive reasoning, multimodal capability, permissive licensing, and edge deployment in a single release. It's the kind of model that makes self-hosting not just feasible, but practical.

The Tools to use are Approachable for a Beginner Technical Audience

Capable models on their own are not enough. You need a way to run them without a PhD in machine learning. That's where tools like LM Studio come in.

LM Studio is a desktop application that lets you download, run, and chat with open-weight language models through a polished graphical interface — no command line required. It handles model discovery (browsing thousands of models on Hugging Face), hardware detection (automatically using your GPU), and configuration. You search for a model, click download, click load, and start chatting. It also exposes an OpenAI-compatible API on your local machine, which means any tool that works with the OpenAI API also works with your local model. Point it at localhost:1234, and your applications don't know or care that the model is running on your desk instead of in someone's cloud.

But the feature that tells the bigger story is LM Link, released in February 2026 in partnership with Tailscale. LM Link lets you connect to models running on other machines you own — end-to-end encrypted, never exposed to the public internet. Install LM Studio on a powerful desktop at the office. Run the models there. Connect from your laptop at home, at a coffee shop, or on the road. The models show up as if they're local. Your chats stay on your device. The heavy processing happens on the powerful machine.

This is the piece that shifts the mental model. You're not just running AI on your laptop. You're building a private AI infrastructure that your people can access from anywhere, on hardware you control, with models you choose, and data that never leaves your network.

The Economics That are More Interesting than a Token Budget

A Mac Mini M4 Pro with 48GB of unified memory costs approximately $1,600 to $2,000. It runs 30B+ parameter models comfortably, and can handle quantized models significantly larger than that, though you'll start trading speed for size as you push past 30B. It draws about 30–40 watts under AI workload, under $50 a year in electricity for 24/7 operation. It's silent. It's the size of a hardcover book. The software to run models on it, LM Studio and Ollama, is free.

Now consider the alternative. Enterprise AI seats vary widely. Team-tier plans from Anthropic and OpenAI run $25 to $30 per user per month. Full enterprise seats with SSO, compliance, and admin controls range from $45 to $75 per user per month. And for individual power users who need maximum capacity — plans like Claude Max or ChatGPT Pro — you're looking at $100 to $200 per month per person. For a team of ten on enterprise plans, that's $5,400 to $9,000 annually. For a hundred engineers, $54,000 to $90,000. Power users at the top tier push those numbers significantly higher. (As of 4/7/2026 - these are estimates, some may be a bit less, but the economics are close, also not considering whether or not your company can benefit from any bulk usage benefits, which tend to be standard operation with scaled agreements)

And then there's Jensen's vision. Not every engineer draws a $500k salary; realistically, though, if we start with this extreme, $250,000 per engineer per year in token consumption. For a team of fifty, that's $12.5 million. Annually.

Meanwhile, five Mac Minis with 48GB each — running Gemma 4, Qwen, Llama, or whatever model fits the task — costs about $10,000 upfront. Total. With no recurring per-token cost. No metering. No vendor is deciding to change their pricing next quarter. You've bought the capability, not rented it.

I'm not suggesting you can replace frontier models entirely. There are tasks where GPT-5, Claude Opus, or Gemini 3 are genuinely superior — complex multi-step reasoning over massive context windows, specialized domain tasks that benefit from the largest models. The frontier has a frontier for a reason.

But I am suggesting that a significant portion of the AI workload in most organizations—code completion, document summarization, data transformation, internal chat assistants, draft generation, research synthesis, classification tasks—can run on models you own, on hardware you own, at a fraction of the recurring cost.

The question isn't about the frontier or the local. It's knowing which work belongs where.

What This Looks Like In Practice

If you put these pieces together, the story that emerges is more diverse and more interesting than "buy tokens from a frontier provider."

Imagine your organization runs a model router — a lightweight layer that knows which tasks can run locally and which need to escalate to a cloud-hosted frontier model. Routine code completions, documentation generation, data formatting, internal Q&A — these run on local hardware using Gemma 4 or equivalent open models. Zero marginal cost per query. Data never leaves your infrastructure.

Complex reasoning tasks, large-context analysis, or specialized capabilities that only the largest models handle well — those route to the frontier API. You're paying for tokens where the tokens genuinely matter, not burning them on work that a local 32B model handles perfectly well.

LM Studio's LM Link already enables this topology. Your team connects to the local model infrastructure through their usual tools — Cursor, Claude Code, or any OpenAI-compatible client - and routing happens transparently. Cloud when you need it. Local when you don't.

People in the OpenClaw community are already running this pattern: local Ollama instances handle heartbeat tasks and routine operations, while cloud models handle complex reasoning when the task demands it. The economics are dramatically different when you stop sending every request to the most expensive model available.

History Repeats Itself

We've seen this pattern before. Many times.

We went from monolithic single-core CPUs to multi-core processors — breaking work into smaller pieces distributed across specialized units. We went from monolithic applications to microservices and the cloud — distributing workloads across infrastructure matched to the demand. In every case, the evolution moved in the same direction: from centralized, expensive, one-size-fits-all toward distributed, specialized, and matched to the task.

The world is already covered with data centers. The compute infrastructure exists. What's changing is that AI inference is following the same path that general compute followed — moving from centralized cloud services toward a hybrid model where local hardware handles the bulk of routine work and cloud services handle the peaks and the specialized tasks.

The hardware is already shipping with dedicated AI capabilities. Apple Silicon's Neural Engine, Qualcomm's AI accelerators, NVIDIA's consumer GPU lineup — every new device that lands on someone's desk is more capable of running models locally. In a few years, the question won't be whether to run models locally. It'll be why you're still sending all your queries to someone else's data center.

The Call to Action

If you haven't started exploring self-hosted models, now is a good time to begin. Not because you need to replace your frontier model subscriptions tomorrow, but because the landscape of options is broader than the marketing from the major AI companies would suggest.

Start by downloading LM Studio and running a model on your laptop. Gemma 4's 26B MoE variant is a good first choice — it runs like a 4B model while performing like something much larger. See what it can do. See what it can't. Develop an intuition for which tasks need a frontier model and which don't.

Think about what a local model infrastructure could look like for your organization. A few dedicated machines running models that your team accesses through LM Link. A routing layer that sends routine work locally and complex work to the cloud. A cost model that turns unpredictable per-token spending into a fixed hardware investment with zero marginal cost.

We're heading toward a world where intelligent model routers know when work can run locally versus when it needs to go to the cloud. Where hardware ships with dedicated capabilities to support local AI workloads as a standard feature. Where the frontier model companies compete not just on model quality, but on the tasks that genuinely require their scale, while everything else runs on infrastructure you own.

Jensen Huang is right that AI will amplify what engineers do. He's right that token budgets will become a standard part of how companies invest in their people. But he's selling a version of that future where every token flows through the data center buildout he's investing a trillion dollars in.

The other version — the one where capable models run on your desk, on your network, on your terms — is already here. It's just not being marketed as aggressively.

Explore both. Make deliberate choices. And don't write the check for a million tokens until you know which of those tokens actually need to come from someone else.

Cheers, ~ John