Platform Modernization16 weeks engagement

Modernizing a Kubeflow-based ML platform into an enterprise inference platform

The client's inference workloads had outgrown their Kubeflow-based platform. What started as a flexible ML workflow environment had become difficult to operate at production scale: model rollouts were slow, serving paths were inconsistent, and the platform was not optimized for modern LLM and accelerator-heavy inference workloads.

We led an end-to-end modernization from the legacy Kubeflow environment to an enterprise-grade inference platform built on Ray, cloud-managed Kubernetes, vLLM, and NVIDIA Triton. The new architecture provided a unified serving fabric across GPU and TPU hardware while allowing the team to migrate workloads incrementally with no production downtime.

GPU + TPU

Unified serving fabric

Faster

Safer model rollouts

Lower

Platform complexity

The Challenge

The existing platform relied heavily on Kubeflow for orchestration and ML workflow management. While Kubeflow provided a strong foundation for Kubernetes-native ML pipelines, the client's production inference needs had evolved beyond what the legacy implementation could comfortably support.

Several problems were becoming increasingly expensive to manage:

Model releases required too many manual steps and environment-specific changes
Inference services were deployed inconsistently across teams
Rollbacks were slow and risky
GPU utilization was difficult to optimize across diverse workloads
LLM serving required higher throughput, better batching, and more predictable latency
The team wanted to use TPU capacity for suitable workloads, but the existing platform did not provide a clean path for mixed GPU and TPU serving
Operational ownership was unclear because pipeline orchestration, model deployment, serving runtime behavior, and infrastructure concerns were tightly coupled

The goal was not simply to replace Kubeflow. The goal was to separate concerns: use the right tool for distributed serving, the right runtime for each model type, and the right infrastructure abstraction for accelerator-aware scheduling.

Target Architecture

We designed a unified, multi-runtime inference platform that decoupled orchestration from model serving. Instead of forcing every workload through one serving path, the new platform allowed each model to run on the runtime and accelerator best suited to its behavior.

Ray served as the distributed compute and serving control plane. It provided a flexible foundation for managing distributed workloads, scaling replicas, routing traffic, and coordinating inference services across the cluster.

vLLM was used for high-throughput LLM serving, where features like continuous batching, efficient KV-cache management, and distributed inference support are critical for serving modern language models efficiently.

NVIDIA Triton Inference Server was used for non-LLM and multi-framework workloads, including models that required standardized model repositories, dynamic batching, ensemble pipelines, or support across multiple ML frameworks.

The platform ran on cloud-managed Kubernetes with dedicated GPU and TPU node pools. Accelerator-aware scheduling ensured that workloads landed on the right hardware based on model requirements, cost profile, and runtime compatibility.

Core architecture components included:

Ray as the distributed serving and compute control plane
Ray Serve for request routing, deployment management, and scalable service composition
vLLM for high-throughput LLM inference
NVIDIA Triton for multi-framework, traditional ML, and ensemble model serving
Cloud-managed Kubernetes for cluster operations, networking, node lifecycle management, and workload isolation
GPU and TPU node pools for accelerator-specific workloads
Standardized model packaging to make deployments repeatable
A model registry to track model versions, artifacts, metadata, and promotion stages
Centralized observability for latency, throughput, error rate, GPU/TPU utilization, queue depth, and rollout health
Canary, shadow, and rollback workflows to reduce migration and release risk

This gave the client a serving platform that was both flexible and controlled: teams could choose the right runtime for each model, but releases still followed a common operational standard.

Target architecture — Ray routes each model to the right runtime (vLLM or Triton) and accelerator pool (GPU or TPU); dashed lines are deploys and telemetry.

Migration Approach

We avoided a risky "big bang" migration. Instead, we ran the new platform alongside the existing Kubeflow environment and moved workloads incrementally behind a routing layer.

The first phase was discovery. We inventoried every production model and grouped workloads by traffic volume, business criticality, hardware requirements, latency sensitivity, framework, dependency complexity, and migration risk.

Next, we established baseline behavior for each model. This included latency, throughput, error rate, cost profile, model output parity, and operational dependencies. The baseline gave us a clear definition of success before moving traffic.

For each workload, we built the new serving path on the modern platform and validated it before production cutover. Where possible, we shadow-tested production traffic through the new service and compared outputs, latency, and resource usage against the legacy path.

Cutovers were performed model by model. Traffic was gradually shifted from the Kubeflow path to the new Ray-based serving platform using canary releases and controlled routing. Each migration had an immediate rollback path, allowing the team to revert quickly if latency, quality, or error rates moved outside acceptable bounds.

The migration process included:

Inventorying all existing workloads and ranking them by traffic, criticality, and migration complexity
Establishing baseline latency, throughput, quality, and cost metrics for each model
Standardizing container images, model artifacts, runtime configuration, and deployment templates
Mapping each workload to the right serving runtime: vLLM for LLMs, Triton for multi-framework and ensemble workloads
Validating GPU and TPU placement requirements through Kubernetes scheduling policies
Shadow-testing new serving paths against production traffic before cutover
Using canary releases to shift traffic gradually
Keeping the legacy Kubeflow path available for rollback until each workload was fully validated
Creating operational runbooks for deployment, rollback, scaling, incident response, and model promotion

This approach allowed the business to keep running while the platform was modernized underneath it.

Zero-downtime migration — legacy and modern paths run side by side behind a routing layer; dashed lines are shadow traffic and the rollback path.

Outcome

The modernization produced a more reliable, scalable, and maintainable inference platform.

Model deployment became more repeatable because packaging, runtime configuration, promotion, and rollback workflows were standardized. Teams no longer had to manage every model release as a custom infrastructure exercise.

The platform also became more flexible. LLM workloads could run on vLLM, traditional and multi-framework models could run on Triton, and accelerator-specific workloads could be placed on GPU or TPU capacity depending on performance and cost requirements.

Operationally, the team gained clearer ownership boundaries. Kubeflow no longer had to carry responsibilities better handled by a production serving layer. Ray, vLLM, Triton, Kubernetes, and the model registry each served a specific purpose in the architecture.

The client gained:

A unified inference platform across GPU and TPU workloads
Faster and safer model rollout workflows
Runtime flexibility across LLM, traditional ML, and ensemble models
Better accelerator utilization through workload-aware placement
Reduced operational complexity from standardized deployment patterns
Zero-downtime migration through shadow testing, canary releases, and rollback controls
Clear runbooks that allowed the internal team to operate the platform independently

Technologies

RayRay ServevLLMNVIDIA Triton Inference ServerKubernetesKubeRayGPU node poolsTPU node poolsModel registryCanary deploymentShadow traffic testingObservability

Facing a similar challenge?

Let's discuss how we can help your team reach production with confidence.

More case studies

Scaling & Performance

Scaling LLM inference for production launch traffic

Production Operations