HN
Today

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

Kog AI has unveiled a groundbreaking inference engine capable of 3,000 tokens/s on standard datacenter GPUs, challenging the notion that dedicated hardware is necessary for ultra-fast LLM responses. This technical marvel, achieved through deep software and hardware co-design, promises to revolutionize real-time AI agents by removing critical latency bottlenecks. Hacker News is interested in the implications for AI efficiency and the potential to unlock new agentic applications.

7
Score
0
Comments
#4
Highest Rank
11h
on Front Page
First Seen
May 29, 10:00 AM
Last Seen
May 29, 8:00 PM
Rank Over Time
545710161011171826

The Lowdown

Kog AI has launched a tech preview of its Kog Inference Engine (KIE), demonstrating real-time LLM inference at an unprecedented speed of 3,000 tokens/s per request on standard GPUs. This achievement is particularly significant for the development of autonomous AI agents, where single-request decoding speed is paramount for seamless, sequential interactions.

  • KIE delivers 3,000 output tokens/s on 8x AMD MI300X GPUs and 2,100 tokens/s on 8x NVIDIA H200, using a 2B model without speculative decoding or quantization.
  • The article identifies memory bandwidth, not FLOPS, as the primary bottleneck for low-batch LLM decoding, where typical GPU utilization is very low due to software overheads.
  • Standard inference stacks incur significant microsecond losses from kernel boundaries, CPU scheduling, grid synchronization, inter-GPU communication, and non-optimized memory access.
  • Kog AI's solution involves a holistic co-design approach, integrating model architecture, runtime, and low-level GPU code to optimize for latency.
  • Key innovations include a persistent "monokernel" runtime that processes the entire decoding sequence without interruption, custom KCCL inter-GPU communication primitives, and the "Laneformer" model architecture utilizing Delayed Tensor Parallelism (DTP).
  • The team also performs deep hardware-aware optimizations, such as topology-aware memory access on chiplet-based GPUs like the AMD MI300X, to eliminate microsecond-level latencies.
  • Kog projects that their engine will scale to large third-party MoE models, expecting speeds of 1,000-5,000 tokens/s/request on future datacenter GPUs.
  • Kog is a Paris-based AI infrastructure startup, founded by Gaël Delalleau, which has raised $5M from Varsity VC and BPI France's Deep Tech Program.

This demonstration effectively redefines the performance ceiling for LLM inference on existing hardware, proving that deep software optimization can unlock capabilities previously thought exclusive to specialized or next-generation silicon, thereby accelerating the potential of real-time AI agentic systems.