All case studies
Platform Modernization16 weeks engagement

Modernizing a Kubeflow-based ML platform into an enterprise inference platform

The client's inference workloads had outgrown their Kubeflow-based platform. What started as a flexible ML workflow environment had become difficult to operate at production scale: model rollouts were slow, serving paths were inconsistent, and the platform was not optimized for modern LLM and accelerator-heavy inference workloads.

We led an end-to-end modernization from the legacy Kubeflow environment to an enterprise-grade inference platform built on Ray, cloud-managed Kubernetes, vLLM, and NVIDIA Triton. The new architecture provided a unified serving fabric across GPU and TPU hardware while allowing the team to migrate workloads incrementally with no production downtime.

GPU + TPU
Unified serving fabric
Faster
Safer model rollouts
Lower
Platform complexity

The Challenge

The existing platform relied heavily on Kubeflow for orchestration and ML workflow management. While Kubeflow provided a strong foundation for Kubernetes-native ML pipelines, the client's production inference needs had evolved beyond what the legacy implementation could comfortably support.

Several problems were becoming increasingly expensive to manage:

  • Model releases required too many manual steps and environment-specific changes
  • Inference services were deployed inconsistently across teams
  • Rollbacks were slow and risky
  • GPU utilization was difficult to optimize across diverse workloads
  • LLM serving required higher throughput, better batching, and more predictable latency
  • The team wanted to use TPU capacity for suitable workloads, but the existing platform did not provide a clean path for mixed GPU and TPU serving
  • Operational ownership was unclear because pipeline orchestration, model deployment, serving runtime behavior, and infrastructure concerns were tightly coupled

The goal was not simply to replace Kubeflow. The goal was to separate concerns: use the right tool for distributed serving, the right runtime for each model type, and the right infrastructure abstraction for accelerator-aware scheduling.

Target Architecture

We designed a unified, multi-runtime inference platform that decoupled orchestration from model serving. Instead of forcing every workload through one serving path, the new platform allowed each model to run on the runtime and accelerator best suited to its behavior.

Ray served as the distributed compute and serving control plane. It provided a flexible foundation for managing distributed workloads, scaling replicas, routing traffic, and coordinating inference services across the cluster.

vLLM was used for high-throughput LLM serving, where features like continuous batching, efficient KV-cache management, and distributed inference support are critical for serving modern language models efficiently.

NVIDIA Triton Inference Server was used for non-LLM and multi-framework workloads, including models that required standardized model repositories, dynamic batching, ensemble pipelines, or support across multiple ML frameworks.

The platform ran on cloud-managed Kubernetes with dedicated GPU and TPU node pools. Accelerator-aware scheduling ensured that workloads landed on the right hardware based on model requirements, cost profile, and runtime compatibility.

Core architecture components included:

  • Ray as the distributed serving and compute control plane
  • Ray Serve for request routing, deployment management, and scalable service composition
  • vLLM for high-throughput LLM inference
  • NVIDIA Triton for multi-framework, traditional ML, and ensemble model serving
  • Cloud-managed Kubernetes for cluster operations, networking, node lifecycle management, and workload isolation
  • GPU and TPU node pools for accelerator-specific workloads
  • Standardized model packaging to make deployments repeatable
  • A model registry to track model versions, artifacts, metadata, and promotion stages
  • Centralized observability for latency, throughput, error rate, GPU/TPU utilization, queue depth, and rollout health
  • Canary, shadow, and rollback workflows to reduce migration and release risk

This gave the client a serving platform that was both flexible and controlled: teams could choose the right runtime for each model, but releases still followed a common operational standard.

Inference RequestsRay — Distributed Serving Control PlaneRay Serve · routing · replica management · scalingModel Registryversions · artifacts · promotiondeployLLMsmulti-frameworkvLLMHigh-throughput LLM servingcontinuous batching · KV-cacheNVIDIA TritonMulti-framework · ensemblesdynamic batchingKUBERNETES · ACCELERATOR-AWARE SCHEDULINGGPU Node PoolLLM & GPU workloadsTPU Node Poolsuitable workloadsObservability — latency · throughput · error rate · GPU/TPU utilization · queue depth · rollout health
Target architecture — Ray routes each model to the right runtime (vLLM or Triton) and accelerator pool (GPU or TPU); dashed lines are deploys and telemetry.

Migration Approach

We avoided a risky "big bang" migration. Instead, we ran the new platform alongside the existing Kubeflow environment and moved workloads incrementally behind a routing layer.

The first phase was discovery. We inventoried every production model and grouped workloads by traffic volume, business criticality, hardware requirements, latency sensitivity, framework, dependency complexity, and migration risk.

Next, we established baseline behavior for each model. This included latency, throughput, error rate, cost profile, model output parity, and operational dependencies. The baseline gave us a clear definition of success before moving traffic.

For each workload, we built the new serving path on the modern platform and validated it before production cutover. Where possible, we shadow-tested production traffic through the new service and compared outputs, latency, and resource usage against the legacy path.

Cutovers were performed model by model. Traffic was gradually shifted from the Kubeflow path to the new Ray-based serving platform using canary releases and controlled routing. Each migration had an immediate rollback path, allowing the team to revert quickly if latency, quality, or error rates moved outside acceptable bounds.

The migration process included:

  • Inventorying all existing workloads and ranking them by traffic, criticality, and migration complexity
  • Establishing baseline latency, throughput, quality, and cost metrics for each model
  • Standardizing container images, model artifacts, runtime configuration, and deployment templates
  • Mapping each workload to the right serving runtime: vLLM for LLMs, Triton for multi-framework and ensemble workloads
  • Validating GPU and TPU placement requirements through Kubernetes scheduling policies
  • Shadow-testing new serving paths against production traffic before cutover
  • Using canary releases to shift traffic gradually
  • Keeping the legacy Kubeflow path available for rollback until each workload was fully validated
  • Creating operational runbooks for deployment, rollback, scaling, incident response, and model promotion

This approach allowed the business to keep running while the platform was modernized underneath it.

RequestsRouting Layercanary · traffic splitLegacy — Kubeflow Servingkept available for rollbacktraffic shrinking over timeModern — Ray + vLLM / Tritonshadow-validated before cutovercanary traffic growing over timeshadow trafficrollbackINCREMENTAL, MODEL-BY-MODELDiscoveryBaselineShadow testCanaryCutover
Zero-downtime migration — legacy and modern paths run side by side behind a routing layer; dashed lines are shadow traffic and the rollback path.

Outcome

The modernization produced a more reliable, scalable, and maintainable inference platform.

Model deployment became more repeatable because packaging, runtime configuration, promotion, and rollback workflows were standardized. Teams no longer had to manage every model release as a custom infrastructure exercise.

The platform also became more flexible. LLM workloads could run on vLLM, traditional and multi-framework models could run on Triton, and accelerator-specific workloads could be placed on GPU or TPU capacity depending on performance and cost requirements.

Operationally, the team gained clearer ownership boundaries. Kubeflow no longer had to carry responsibilities better handled by a production serving layer. Ray, vLLM, Triton, Kubernetes, and the model registry each served a specific purpose in the architecture.

The client gained:

  • A unified inference platform across GPU and TPU workloads
  • Faster and safer model rollout workflows
  • Runtime flexibility across LLM, traditional ML, and ensemble models
  • Better accelerator utilization through workload-aware placement
  • Reduced operational complexity from standardized deployment patterns
  • Zero-downtime migration through shadow testing, canary releases, and rollback controls
  • Clear runbooks that allowed the internal team to operate the platform independently

Technologies

RayRay ServevLLMNVIDIA Triton Inference ServerKubernetesKubeRayGPU node poolsTPU node poolsModel registryCanary deploymentShadow traffic testingObservability

Facing a similar challenge?

Let's discuss how we can help your team reach production with confidence.