NVIDIA has introduced Dynamo at its GTC 2025 conference, a groundbreaking solution for enhancing the performance of AI reasoning models with its exceptional capabilities and innovative technical architecture. It not only significantly boosts inference performance but also dramatically reduces costs by optimizing resource utilization, providing enterprises with powerful support to compete in the AI era. As AI inference becomes increasingly mainstream, improving inference efficiency and scalability while continuously reducing inference costs become service providers' core needs.
What is NVIDIA Dynamo
Described as the operating system of an AI factory, Dynamo is a high-throughput, low-latency open-source inference framework specially engineered for deploying generative AI and reasoning models in multi-node large-scale distributed environments, focusing on the specific capabilities of capturing LLMs (Large Language Models), and maximizing token revenue generation while reducing costs.
This distributed inference library is essentially an open-source solution for handling the issues between requested tokens and generated tokens numbers. It can tackle the challenges of AI inference at scale and allow enterprises to increase throughput and reduce cost while using large language models on NVIDIA GPUs.
Why NVIDIA Dynamo is the Key to AI Inference Efficiency
As the scale and complexity of generative AI models continue to grow, traditional inference service frameworks face significant challenges in terms of performance, accuracy, and efficiency. The management of distributed inference systems is inherently complex, and a good user experience is critical for improving efficiency. Efficiently orchestrating and coordinating AI inference requests across a large number of GPUs is crucial for ensuring that AI factories minimize operation costs and maximize token revenue. NVIDIA Dynamo comes into being to address the following critical issues.
- Low GPU Utilization: Traditional monolithic inference pipelines often lead to idle GPUs due to imbalances between the prefill and decode stages.
- High KV Cache Recomputing Costs: Improper routing request will cause frequent refreshing and recomputation of the KV cache (intermediate states of Transformer models).
- Memory Bottlenecks: Large-scale inference workloads require massive KV cache storage, which can quickly exhaust GPU memory capacity.
- Fluctuating Demand and Inefficient GPU Allocation: Traditional service architectures often rely on static GPU resource allocation.
- Inefficient Data Transfer: Distributed inference workloads introduce unique and highly dynamic communication patterns.
Dynamo Core Technical Advantages
As the successor to the NVIDIA Triton™ Inference Server, NVIDIA Dynamo is a brand-new AI inference service software designed to maximize token revenue for AI factories deploying inference AI models. It coordinates and accelerates inference communication across thousands of GPUs and uses disaggregated services to separate the processing and generation phases of LLMs on different GPUs. This allows for the specific needs of each phase to be optimized individually and ensures greater utilization of GPU resources.
NVIDIA Dynamo is fully open-source and supports PyTorch, SGLang, NVIDIA TensorRT™-LLM, and vLLM, enabling enterprises, startups, and researchers to develop and optimize methods for deploying AI models in disaggregated inference scenarios.
Dynamo can solves the problems with traditional inference framework through the following key technologies.
- Separated Prefill and Decode Inference (P&D Separation): Maximizes GPU throughput, balance throughput and latency.
- Dynamic Resource Allocatio: Optimizes performance based on fluctuating demand.
- LLM-Aware Request Routing: Eliminates unnecessary KV cache recomputation.
- Accelerated Data Transfer: Reduces inference response time using NIXL.
- KV Cache Offloading: Leverages multi-tier memory hierarchy to enhance system throughput.
Dynamo Architecture
Dynamo adopts modular design, with each component being independently scalable.
- API Server: Can be deployed to adapt to specific tasks.
- Smart Router: Handles user requests and routes them to the optimal worker node.
- KV Cache-Aware Routing: Directs requests to the worker node with the highest cache hit rate.
- KV Cache Manager: Maintains a global radix tree registry and manages a multi-tier memory system.
- Data Transfer Acceleration: Uses NIXL to achieve fast and efficient data transfer.
Dynamo skillfully combines the advantages of both Rust and Python programming language. It uses Rust for building performance-sensitive modules, offering speed, memory safety, and robust concurrency, while Python for providing flexibility, supporting rapid prototyping and customization. This dual-language implementation ensures that Dynamo has both high performance and excellent scalability.
The architectural design of Dynamo aims to tackle the key challenges in large-scale distributed inference services. Its architecture consists of 4 core components.
Disaggregated Serving
Dynamo divides the LLM inference process into Prefill and Decode two separate phases. Prefill refers to the the computationally intensive phase that processes the initial hints and generates the KV cache. Decode is a delay-sensitive phase that generates new tokens one by one.
This separation enables Dynamo to allocate the most appropriate resources to each stage, dynamically adjusting the number of pre-populated and decoded worker nodes based on actual load, thus improving overall GPU utilization.
Disaggregated prefilling and decoding significantly improves performance, especially in inference scenarios involving multiple GPUs. For example, for the Llama 70B model, single-node tests show a 30% increase in GPU throughput, while the dual-node setup achieves more than a 2x performance increase due to better parallelization.
This separation strategy provides valuable flexibility. Prefill and decoding phases are directly related to the generation time of the first token (TTFT) and the inter-token latency (ITL), and by adjusting the allocation of worker nodes, the performance can be optimized for a specific service level agreement (SLA), whether it is prioritizing faster TTFTs, lower ITLs, or higher throughput.
Smart Router
Dynamo's KV-aware routing routes user queries to the working node with the highest KV cache hit rate, not just the least busy node. KV-aware routing achieves 3 times improvement in first token generation time (TTFT) in 100,000 real R1 user query tests, 2 times reduction in average request latency, great improvements in overall throughput depending on traffic status.
Distributed KV Cache Manager
The Cache Manager is responsible for managing the KV cache in a multi-layer memory system to achieve global base tree registry maintenance, fast storage and reclamation of KV caches, cache optimization across multi-layer memory (GPU HBM, system memory, SSD, remote storage).
This design significantly reduces the first token generation time (TTFT), improves throughput, and supports handling longer context lengths.
NVIDIA Inference Transfer Library
NIXL is optimized for inference workloads, delivering an efficient data transfer mechanism, reducing synchronization overhead and intelligent batching, supporting heterogeneous memory access (remote memory or storage), and offering dynamic selection of optimal transport backends. This acceleration is particularly vital for disaggregated services, ensuring minimal latency when prefilling worker nodes transfer KV cache data to decoding worker nodes.
Conclusion
The launch of NVIDIA Dynamo offers enterprises powerful technical support in cost reduction and efficiency improvement in large-scale model inference. However, for those service providers lacking large-scale GPU cluster resources, how to effectively leverage these advanced inference technologies remains a significant challenge.

