The Hungry, Hungry AI Model

Jul 8, 2025 / AI

When you query AI, it gathers relevant information to answer you.

But, how much information does the model need?

Conversations with practitioners revealed the their intuition : the input was ~20x larger than the output.

But my experiments with Gemini tool command line interface, which outputs detailed token statistics, revealed its much higher.

300x on average & up to 4000x.

Here’s why this high input-to-output ratio matters for anyone building with AI:

$Screenshot 2025-07-08 at 8.42.03\u202fAM.png$

Cost Management is All About the Input. With API calls priced per token, a 300:1 ratio means costs are dictated by the context, not the answer. This pricing dynamic holds true across all major models.

On OpenAI’s pricing page, output tokens for GPT-4.1 are 4x as expensive as input tokens. But when the input is 300x more voluminous, the input costs are still 98% of the total bill.

Latency is a Function of Context Size. An important factor determining how long a user waits for an answer is the time it takes the model to process the input.

It Redefines the Engineering Challenge. This observation proves that the core challenge of building with LLMs isn’t just prompting. It’s context engineering.

The critical task is building efficient data retrieval & context - crafting pipelines that can find the best information and distilling it into the smallest possible token footprint.

Caching Becomes Mission-Critical. If 99% of tokens are in the input, building a robust caching layer for frequently retrieved documents or common query contexts moves from a “nice-to-have” to a core architectural requirement for building a cost-effective & scalable product.

$Screenshot 2025-07-08 at 8.33.51\u202fAM.png$

For developers, this means focusing on input optimization is a critical lever for controlling costs, reducing latency, and ultimately, building a successful AI-powered product.