
Sign in to join the discussion.
Every time a large language model generates a word, it consults a running log of everything it has already processed in the conversation. That log - called the key-value (KV) cache - grows with every token and sits entirely in GPU memory. For a model handling a short exchange, this is manageable. For a model reasoning over a legal brief, a codebase, or an hour-long transcript, the KV cache can consume most of the available memory on an accelerator, crowding out the capacity to run more requests in parallel. Longer context, in other words, directly means higher serving costs.
TurboQuant is a vector quantization algorithm published by Google Research on March 24, 2026, that compresses the KV cache by at least 6x - reducing the precision of stored values to as few as 3 bits - without any measurable degradation in output quality across five standard long-context benchmarks.[1] On Nvidia H100 GPUs, the 4-bit version delivers up to an 8x speedup in computing attention logits compared to 32-bit unquantized keys.[2] The algorithm slots into existing inference pipelines without retraining or architectural changes - a practical threshold that separates research curiosity from production candidate.
The work will be presented at ICLR 2026 next month, co-authored by research scientist Amir Zandieh and VP and Google Fellow Vahab Mirrokni.[1]
The core challenge in compressing a KV cache is not compression itself - it is the bookkeeping that compression requires. Standard quantization methods shrink data by mapping precise floating-point values to coarser integers, but they also have to record, for each compressed block, exactly what scaling factor was used - otherwise the data cannot be correctly decoded later. That accounting cost, typically 1 to 2 extra bits per value, partially cancels the compression gain. At hundreds of millions of token positions, it adds up.[1] TurboQuant's central contribution is eliminating that overhead entirely, through two algorithms that work in sequence.
PolarQuant handles the primary compression and is where most of the memory reduction happens. The key insight concerns geometry. In standard quantization, data vectors are described in a coordinate system where each dimension specifies a distance - think of it as giving directions by saying "go two miles north and three miles east." The problem is that the boundaries of the region being quantized shift for every block of data, forcing the compressor to record those boundaries explicitly. PolarQuant sidesteps this by redescribing each vector in terms of a magnitude (how large the overall signal is) and a set of angles (what direction it points). Transformer attention patterns are highly regular in angular space - they cluster predictably rather than spreading uniformly - which means the angular boundaries are stable and do not need to be stored per block. The bookkeeping overhead drops to zero. PolarQuant will be presented separately at AISTATS 2026.[1]
QJL (Quantized Johnson-Lindenstrauss) acts as a residual error corrector. After PolarQuant compresses the bulk of each vector, a small approximation error remains - unavoidable in any lossy compression scheme. Left uncorrected, these residuals accumulate and introduce systematic bias into the model's attention scores, which determine which parts of the input the model focuses on. QJL applies the Johnson-Lindenstrauss Transform, a classical mathematical technique for shrinking high-dimensional data while preserving the relative distances between points, to project the residual into a lower-dimensional space. It then reduces each projected value to a single bit: positive or negative. One bit per residual, zero memory overhead, and the bias is eliminated. The two stages together - PolarQuant for primary compression, QJL for error correction - constitute TurboQuant.[1]
Google tested TurboQuant on open-source models including Gemma, Mistral, and Llama-3.1-8B-Instruct across five long-context evaluation suites: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. On needle-in-a-haystack retrieval tasks - which test whether a model can locate a single fact buried deep inside a very long document - TurboQuant achieved perfect scores while cutting KV cache memory by at least 6x. On LongBench, which spans question answering, code generation, and summarization, TurboQuant matched or outperformed KIVI, the previous leading KV cache quantization method, across all tasks.[1]
KIVI compresses the KV cache using standard integer quantization and represents the current practical baseline for production inference systems. Beating it while simultaneously eliminating per-block overhead is the result that matters most for practitioners evaluating whether TurboQuant is ready for real deployments.
Google also tested TurboQuant on vector search - a closely related problem, since both tasks involve comparing high-dimensional vectors quickly. Evaluated against Product Quantization (PQ) and RabbiQ on the GloVe dataset using the 1@k recall ratio, TurboQuant achieved the highest recall of any tested method. The competing baselines relied on large, dataset-specific codebooks; TurboQuant is entirely data-oblivious, meaning it requires no calibration dataset and generalizes out of the box.[1]
Before examining the market reaction, it is worth being precise about what TurboQuant actually optimizes. It compresses the KV cache, which is an inference-time structure - it only exists while a model is actively generating responses. Training, by contrast, involves entirely different memory pressures: activations, gradients, optimizer states, and weight matrices that TurboQuant does not touch. Any efficiency gain from TurboQuant accrues at the serving layer, not the training layer.
That distinction did not slow the sell-off. On March 26, SK Hynix fell 6.23% and Samsung Electronics dropped 4.71% in Seoul. In U.S. trading, Micron, Western Digital, and Sandisk posted significant declines over two sessions, with Sandisk falling roughly 6%, Western Digital around 5%, and Micron between 3% and 5%.[3] Cloudflare CEO Matthew Prince accelerated the alarm on X, calling TurboQuant "Google's DeepSeek moment" - an explicit parallel to January 2025, when revelations about training efficiency sent Nvidia's stock down nearly 17% in a single session.[4] The comparison is what moved markets. It was also the wrong comparison.
DeepSeek's significance lay in challenging the premise that frontier AI capability required massive training compute - a challenge to the entire infrastructure investment thesis. TurboQuant says nothing about training compute. It says that once a model is trained and deployed, you need less GPU memory to run it over long contexts. Those are different claims with different investment implications.
Several analysts said so plainly. Ben Barringer, head of technology research at Quilter Cheviot, told CNBC: "The Google TurboQuant innovation has added to the pressure, but this is evolutionary, not revolutionary. It does not alter the industry's long-term demand picture. In a market primed to de-risk, even an incremental development can be taken as a cue to lighten up."[4] Morgan Stanley analyst Shawn Kim framed TurboQuant as a net positive for hyperscalers: lower memory cost per token reduces serving costs and "can also lead to higher product adoption demand."[3] JPMorgan and Citigroup analysts each invoked the Jevons Paradox - the 19th-century economic principle that efficiency improvements in resource use tend to increase total consumption of that resource, not decrease it.[3]
Ray Wang of SemiAnalysis offered the clearest technical reframe: the KV cache is "a key bottleneck to address to have better models and hardware performance," and removing a bottleneck rarely causes the bottlenecked system to need fewer resources.[4] When models can serve more queries per unit of memory, the economic incentive is to serve more queries - not to purchase less hardware.
The more consequential story is not the stock movement but what TurboQuant represents algorithmically. Google's researchers describe these methods as operating near theoretical lower bounds - meaning TurboQuant is not a heuristic that happens to work well on current benchmarks, but a compression method backed by mathematical proofs of optimality.[1] For production systems, that distinction is material: heuristic optimizations can fail unpredictably on out-of-distribution inputs; provably efficient methods are far more robust across deployment conditions.
The application space extends beyond inference. TurboQuant directly accelerates vector search - the backbone of semantic search, retrieval-augmented generation (RAG), and recommendation systems. RAG, for context, is the technique of retrieving relevant documents from a large database and feeding them into a language model's context window before generating a response; it is how most enterprise AI systems connect models to live data. At Google's operating scale, compressing billion-vector indices while maintaining recall accuracy is a significant infrastructure advantage. The research blog notes the algorithm builds vector indices with "minimal memory, near-zero preprocessing time, and state-of-the-art accuracy."[1]
The paper notes implications for Gemini specifically, citing the KV cache bottleneck as a problem for "models like Gemini" running at scale.[1] Whether TurboQuant has already been deployed in Google's production systems was not disclosed - a gap worth watching. A community implementation for llama.cpp appeared on Reddit within days of the announcement, suggesting the algorithm is already finding traction in the open-source inference ecosystem.[5]
TurboQuant achieves perfect downstream results across all benchmarks while reducing key-value memory size by a factor of at least 6x - with no retraining required.
Long-context AI applications - extended agent sessions, document analysis, multi-turn reasoning - have been constrained precisely because KV cache memory costs scale linearly with context length. Every doubling of context window size doubles the memory required to hold the cache. TurboQuant cuts that tax by 6x, making context lengths that were previously too expensive to deploy commercially now economically viable. That is a catalyst for expanded deployment, not a rationale for buying fewer chips.
Google Research Blog: TurboQuant - Redefining AI efficiency with extreme compression (March 24, 2026) Inline ↗
Tom's Hardware: Google's TurboQuant reduces AI LLM cache memory capacity requirements by at least six times (March 25, 2026) Inline ↗
Bloomberg via Yahoo Finance: Chip Selloff Deepens After Google Touts Memory Breakthrough (March 26, 2026) Inline ↗
CNBC: A Google AI breakthrough is pressuring memory chip stocks from Samsung to Micron (March 26, 2026) Inline ↗
Reddit r/LocalLLaMA: TurboQuant for GGML - 4.57x KV Cache Compression Enabling 72K context Inline ↗