Generative AI (GenAI) has rapidly become one of the most widely adopted technologies in history, quickly establishing itself as a core component of countless consumer and enterprise products and services. As the demand for GenAI output grows, an organization’s existing cloud solutions may struggle to support acceptable levels of AI inferencing throughput or latency. Limited AI inference capabilities can be an especially critical risk for the increasing number of companies that rely on real-time GenAI for tasks like live customer support, automated software debugging, streaming data analysis, adaptive advertising, or dynamic supply chain optimization, among others. When additional factors like infrastructure costs, workflow efficiency, and service-level agreements come into play, it’s clear that a business’s choice of a cloud-based inferencing platform can directly influence both its technical trajectory and its ability to compete.  

One way to improve the AI inferencing performance of your GenAI applications is to choose the right Kubernetes-based inference platform. Because standard load balancing approaches may not be sufficient for the nuanced demands of LLM inferencing workloads, a Kubernetes-based platform optimized for GenAI workloads can mean the difference between AI services that meet latency requirements and ones that frustrate end users. 

To evaluate how inference-optimized routing can affect performance in a Kubernetes-based inferencing environment, we used the Kubernetes inference-perf benchmark on the Llama 3.1-8B Instruct model to test two cloud environments head-to-head: Google Kubernetes Engine (GKE) with GKE Inference Gateway, and Amazon Elastic Kubernetes Service (EKS) using a standard HTTP application load balancer. To remove hardware as a variable in our comparison, we ran both solutions on identical hardware—eight NVIDIA A100 40GB GPUs.

We found that GKE with GKE Inference Gateway outperformed Amazon EKS across each of the inference performance measures we tested. Specifically, GKE with GKE Inference Gateway delivered approximately 15.7 percent higher token throughput (allowing a greater number of concurrent users), a 92.8 percent reduction in timetofirsttoken (improving users’ perception of responsiveness), and a 62.6 percent reduction in intertoken latency (enabling smoother data streaming). The GKE solution also improved latency stability and behavior under increasing requests-per-second rates, showing fewer extremely slow requests and improving consistency for end users. 

Together, the throughput, latency, and stability improvements that GKE with GKE Inference Gateway delivers could translate into better end-user experiences and improved efficiency for infrastructure operators—and lay a better foundation for scaling interactive GenAI workloads. For companies integrating GenAI into their production architecture, GKE with GKE Inference Gateway can provide meaningful improvements in responsiveness, capacity, and cost efficiency—even on the same GPU hardware. 

To learn more about how we compared AI inference performance on Google GKE with GKE Inference Gateway and Amazon EKS, check out the report below.