Opinions
Companies are adopting AI faster than they can manage their token economics


Content writer and editor
12 min
In April 2026, Uber's engineering organization ran out of money. Not the company, just the AI budget: a single coding assistant had gone from roughly a third of the engineering team to more than 80%, draining the full year's allocation in four months. That same spring, Microsoft pulled Claude Code licenses from many of its developers months after handing them out. At Priceline, a routine contract renewal reportedly came back four to five times more expensive than the year before.
None of this was driven by rising prices. Per-token prices for GPT-4-level performance have dropped an estimated 98% since late 2022. Enterprise AI bills climbed anyway, up an estimated 320% over the same period, with the average budget going from about $1.2 million in 2024 to roughly $7 million in 2026. Cheaper tokens, much larger bills.
These are not isolated incidents. The FinOps Foundation, which tracks practices for managing variable technology spend, found that the share of practitioners handling AI costs jumped from 31% in 2024 to 98% in 2026. In early 2026, the Linux Foundation announced plans for a Tokenomics Foundation to build open standards for AI token cost and efficiency, bringing to tokens the same discipline FinOps once brought to cloud spend. When an industry establishes a standards body, the problem ceases to be a quirk and becomes a category.
Although uncomfortable, the pattern is simple to state: companies adopted AI faster than they learned to manage its economics (if they ever did). Most teams still treat the token as an invisible detail of an API call, a figure buried in a billing dashboard nobody opens. It is not. A token (the basic unit of text a language model processes, roughly three to four characters or about three-quarters of a word) is now a unit of operational cost, latency, privacy exposure, and workflow design. Every choice about how an agent reads, remembers, and responds is also a spending decision, whether or not anyone treats it as one.
Our last two posts looked at the security dimension: first, what recent research reveals about trusting AI pentesting agents, then how to test and govern the agentic systems that security teams are beginning to rely on. This post turns to a quieter question that arises when those agents move from pilot to daily use. Not whether they work, but whether you can afford to run them at scale, and whether the shortcuts that make them affordable introduce risks of their own.
Why the old cost model broke
For most of the software era, the cost of a tool scaled with the number of people using it. You bought seats: ten engineers cost ten times what one did, and finance could forecast the line item a year out. However, billing by the token quietly dismantled that assumption, and agentic systems finished the job.
A single-turn LLM interaction is short and bounded: you send a prompt, you get a response, and it ends. An agentic workflow is different: an LLM orchestrates a broader system of tools, planning a course of action, calling external services, reading results, and looping across dozens of steps, carrying its growing history forward at every turn. That history is the cost. In a systematic study of agentic coding tasks, researchers found that these tasks consume on the order of 1,000 times more tokens than ordinary code chat or reasoning, with the bill driven mostly by input tokens—the context fed back into the model on each step—rather than by what the model writes out.
Once an agent opens a file or runs a command, the results remain in its working context until the task ends, so every new step pays for everything again. One analysis of a popular coding agent found that on a single busy day, 99% of the tokens it processed were input tokens accumulated in the agent's trajectory, and only 1% were generated by the model. The visible work, the code or the answer, is the cheap part. The invisible scrollback (the growing log of everything the agent has seen and done) is where the money goes.
Two further factors make this hard to budget for. The first is variance: that same study found a single task, run twice, can differ by up to 30 times in total tokens, because an agent might solve it cleanly on one run and thrash through dead ends on the next. The second is diminishing returns: a system-level analysis from KAIST found that agents grow more accurate as you let them spend more compute, but the gains taper off quickly while cost and latency keep climbing, which the authors describe as a looming sustainability problem, not a free lever. More tokens do not reliably buy better results.
Put those together, and you have exactly the combination that seat-based budgeting cannot model: spend that is variable, nonlinear, and only loosely tied to value. It is why per-developer costs at some large firms were reportedly running from a few hundred to a couple of thousand dollars a month, and why Uber's CTO described his team as back at the drawing board. The old model did not merely underestimate the number; it was measuring the wrong thing.
What you cannot see, you cannot manage
Most teams have no idea where their tokens go. The billing dashboard reports a monthly total, not which task, agent, or step drove it. That gap has to close first, because the savings live in the details the total hides.
A recent study reveals what a monthly total cannot. In "Tokenomics," researchers at Concordia University traced 30 end-to-end software development tasks through a multi-agent framework and mapped every token to a stage of the development lifecycle. The result is counterintuitive. The expensive part was not writing the code. The iterative code-review stage, in which one agent critiques and another revises, accounted for an average of 59.4% of all tokens. The primary cost of agentic software work sits in refinement and verification, the loops that are easy to leave running and hard to spot on an invoice.
You cannot act on any of this without measuring it, and the metrics are simple to define: token counts per task, cost per completed task, cache hit rate, latency, and retry counts. Each of them points at a different leak. A low cache hit rate (the share of calls where the provider served a stored result rather than recomputing the prompt) means you are paying full price for context you could reuse. A high retry count means an agent is thrashing. A climbing input-to-output ratio means history is piling up faster than the work justifies. A token bill you cannot break down by task or step is not a budget; it is a surprise waiting to arrive. The industry is converging on shared definitions for these measures, but no team needs to wait for a standard to start counting.
Spending less on every call
Once you can see the spend, the first lever is to stop paying for the same thing repeatedly. Two techniques do most of the work:
The first is prompt caching, where the provider stores the processed form of a stable prompt prefix so it does not need to be recomputed on the next call. Every major provider supports it, with published guidance from OpenAI, Anthropic, and Google. How much it helps across long, multi-step agent runs was the open question, and Lumer and colleagues answered it in "Don't Break the Cache." Across more than 500 agent sessions on three providers, caching cut API costs by 41% to 80% and reduced the delay before the response begins (time to first token) by 13% to 31%.
The paper's title is itself the warning. Caching only the stable parts, such as a reusable system prompt-holding methodology and tool instructions, gave consistent gains because the prompt prefix stays identical across calls. Caching the full context indiscriminately, including dynamic tool results that change on every run, breaks the match: the cache is never reused, but the system still pays the overhead of checking for it, which can make latency worse than not caching at all. Cache what is stable, and keep what changes at the end of the prompt.
The second lever is context discipline: sending the model less to begin with. Long context is the quietest cost driver in any agentic system, because the agent sends its full accumulated history to the model on every step, and the API charges for all of it each time. Prompt compression attacks this directly. The original LLMLingua work from Microsoft showed that a long prompt could be shrunk by up to 20 times with only a small drop in task performance, by having a small model strip low-information tokens before the expensive model sees them. It is not free, though.
A large-scale 2026 study, "Prompt Compression in the Wild," found that the time compression adds upfront must be outweighed by the savings it generates downstream, and that does not always happen. The more dependable habit sits upstream of any algorithm: retrieve only the files, logs, or passages a task needs, summarize evidence rather than pasting it whole, and resist handing the model everything "just in case." The cheapest token is the one you never send.
Spending on the right model at the right time
The last lever is matching the work to the model, and the timing to the need. A useful framing comes from the survey "Token Economics for LLM Agents," which treats model choice as budget-constrained substitution: you want the cheapest input that still does the job, not the strongest model by reflex.
In practice, that means a tiering strategy. Tasks like formatting, refactoring, routine classification, and summarizing a clean document run perfectly well on a smaller, cheaper model, with escalation to a frontier model reserved for genuinely hard reasoning: an architectural decision, a subtle vulnerability, an ambiguous finding that needs judgment. Defaulting every task to the most capable model is the most common way teams overspend, and it is exactly the behavior that turned modest pilots into runaway bills.
Timing is the other half of the strategy. A large share of business workflows do not need a real-time answer at all. Offline evaluation, dataset classification, bulk document processing, and overnight report generation can run asynchronously, and providers price that flexibility accordingly. OpenAI's cost guidance, for instance, points to its Batch API for consolidating many requests into a single asynchronous job and to flex processing, which trades slower, occasionally interrupted responses for substantially lower costs on non-urgent tasks. Reserving real-time, frontier-model calls for the moments that need them and batching the rest tends to cut costs with no visible drop in quality.
The security side of token economics
Here is where this stops being a finance story and becomes a security one, the part most writing on token economics skips. Every lever in the previous sections changes the system's risk profile, not only its bill, and a security team has to read the two together.
Take prompt caching, the technique that, as we mentioned a moment ago, cuts costs by as much as 80%. It works by processing a shared prefix once and reusing the result, so a cached prompt comes back faster than an uncached one. That speed difference is observable, and observable timing differences are a classic side channel (a way of extracting information not from the content of a system but from how it behaves).
In "Auditing Prompt Caching in Language Model APIs," Gu and colleagues at Stanford showed that when a provider maintains a single cache shared across all users, that speed difference becomes a signal. An attacker can send candidate prompts and measure how long each takes to respond. A fast response signals a cache hit, meaning that a prefix was recently sent by someone else. By systematically testing candidates and noting which ones produce hits, an attacker can infer what other users are asking, without reading their messages directly.
Auditing 17 providers, the team detected caching in eight and global cross-user cache sharing in seven, including OpenAI. The same technique even surfaced an architectural secret: one of OpenAI's embedding models responded in a way consistent only with a decoder-only design, a structural detail the company had not disclosed. The authors connect this directly to Meltdown and Spectre, the hardware attacks that exploited the same principle: the system's behavior leaks information its designers never intended to expose, now resurfacing at the LLM API layer.
The lesson is not to abandon caching. It is that the cheap default and the safe default are not always the same configuration. Per-user caching closes the cross-user leak while keeping most of the savings, but someone has to actively choose it. The same tension runs through the other levers. Context discipline saves money by trimming what you send, yet the opposite temptation, pasting a whole repository or a raw customer log into a prompt to skip a round trip, quietly widens your data-exposure surface and pushes sensitive records into a third-party model. Dropping to a smaller, cheaper model saves tokens, but model and framework choices are not safety-neutral. As our earlier post on governing agentic systems showed, refusal rates for the same malicious requests swung from 30.8% to 52.3% depending only on the agent framework. Each cost optimization is also a security decision, and pretending otherwise is how a cheaper system becomes a more exposed one.
There is a subtler trap as well. Aggressive cost-cutting can erase the very evidence a security team depends on to investigate incidents, verify that an agent behaved as intended, and demonstrate compliance. Compressing context, summarizing tool output, and trimming history all reduce what lands in the logs. Without a complete record, there is no way to reconstruct what an agent decided, which tool it called, what data it accessed, or why it reached a given conclusion. As we argued in that same work, agent logs must preserve prompts, tool calls, permissions, and the reasoning behind a conclusion, not just the final answer. Optimize that record away, and you save a handful of tokens at the cost of your audit trail. Sound token economics treats observability and auditability as constraints, not as more fat to cut.
Why this belongs in a policy, not a habit
All of these ideas raise a question every team adopting agents will face: Should the economic use of AI agents be governed by an explicit internal policy, as cloud usage, data handling, and code review already are? The honest answer is that the alternative—leaving it to individual habit—is what produced the budget blowouts in the first place.
A policy here does not mean a thick document nobody reads. It means agreeing on a handful of defaults before the bills arrive. Which model is the standard choice for routine work, and what justifies escalating to a frontier one. What context an agent may receive, and which categories of data (customer records, secrets, full logs) must never be pasted into a prompt, regardless of convenience. Which prompts are cached, and at what level of sharing. What gets logged, and what a log must retain to stay useful for an audit. When a real-time call is warranted and when work should be batched. None of these demands specialized expertise. They are decisions every team already makes informally, every time they deploy an agent. A policy simply writes them down once, so they are applied consistently and can be revisited when circumstances change.
This is where the cost story and the security story finally meet. The survey "Token Economics for LLM Agents" devotes an entire section to the security dimension and lands on a striking claim: governance functions as economic infrastructure, because the rules you set about how tokens are spent are also the rules that bind your exposure. The emerging standards work, from the FinOps community's expansion into AI to the new Tokenomics Foundation, is the industry trying to write that policy at scale. For a company in application security, or any IT organization, treating agent economics as a governed practice rather than a personal preference is the difference between adopting AI and controlling it.
What this looks like in practice
None of this is merely theoretical. Below, we show how the aforementioned levers are applied in four workplace scenarios in the tech industry.
In security testing, the stable parts are exactly the parts worth caching: the testing scope, the methodology, and the tool instructions an agent needs on every run. The raw material, though, should not be fed back into the model on each step. Feeding an entire scan log or packet capture back into the model at each step is both expensive and a needless expansion of your data-exposure surface. Hence, the better practice is to distill evidence into compact findings and carry those forward. Stronger models earn their cost on the hard calls, confirming a subtle vulnerability or resolving an ambiguous result, while routine triage runs on something lighter.
In development, the win comes from retrieval discipline. A coding agent rarely needs the entire repository in context; it needs a handful of files relevant to the task, plus a cached set of repository conventions to reuse across sessions. Formatting, refactoring, and boilerplate generation run comfortably on a smaller model, while the frontier model is reserved for architectural decisions and vulnerability reasoning, where a wrong answer is costly. The review loop deserves particular attention, since, as the Concordia study showed, that is where the tokens quietly concentrate.
Design work has its own stable core. Brand guidelines and design-system rules change slowly and belong in a cached prompt, while the variable input is small by nature: the specific screen or component under review, not the entire design library. Sending only the relevant context keeps each pass cheap enough to iterate quickly, which is usually the whole reason for involving an agent in design.
Content work rewards a staged approach over a single enormous prompt. Style guides and audience personas are cached once and reused; research is distilled into notes rather than pasted wholesale at every turn; and drafts are built section by section so the context never balloons. Writing this blog post in parts, for instance, rather than as one giant request, was not only easier to steer but also cheaper to produce.
Adoption was the easy part
Every wave of computing eventually gives rise to a discipline to account for its costs. Telecom expense management followed the spread of corporate phone networks; cloud FinOps followed the migration to the major cloud providers; token economics is the same story one cycle later, which is precisely why a Tokenomics Foundation now exists. That pattern is reassuring in one sense: the problem is solvable, and the playbook is partly written.
The gap, though, is real right now. McKinsey found that 92% of companies plan to increase their AI investment over the next three years, while only 1% consider their deployments mature. MIT's NANDA initiative reported that 95% of enterprise GenAI pilots produced no measurable profit-and-loss impact, a figure that has drawn methodological criticism, yet its core finding aligns with broader evidence: the shortfall lies not in the models but in how organizations integrate them. Adoption, in other words, was the easy part. Turning it into value, without losing control of the bill or the attack surface, is the harder and more decisive work.
That is the real message behind the budget blowouts. The companies that come out ahead will not be the ones running the most agents, but those that can see what their agents spend, govern how they spend it, and secure the shortcuts they take to spend less. For a security company, those three are not separate concerns. They are the same practice. Once you treat a token as something worth accounting for, you are managing cost, enforcing governance, and reducing risk at the same time.
Get started with Fluid Attacks' AI security solution right now
Other posts

























