All case studies
Infrastructure & Networking6 weeks engagement

High-performance GPU networking for distributed inference

The client was serving a large model that could not fit on a single GPU. To run the model in production, the serving stack used tensor parallelism across GPUs and pipeline parallelism across nodes.

The model was functional, but performance was being limited by communication overhead. GPUs were spending too much time waiting on collective operations, cross-node transfers, and synchronization instead of doing useful compute.

We redesigned the networking, scheduling, and model placement strategy around the physical GPU interconnect topology. The goal was to make sure the most communication-heavy parts of the workload stayed on the fastest available paths, using NVLink and NVSwitch inside the node and InfiniBand with RDMA and GPUDirect across nodes.

Faster
Inter-GPU communication
Higher
Large-model throughput
Better
Effective GPU utilization

The Challenge

The model was too large to fit on a single GPU, so it was distributed across multiple accelerators.

Tensor parallelism was used to split individual model operations across GPUs. This helped the model fit in memory and allowed large matrix operations to be distributed, but it also introduced frequent communication between GPUs during inference.

Pipeline parallelism was used to split groups of layers across multiple GPUs and nodes. This made it possible to scale the model beyond one machine, but it introduced stage-to-stage communication and scheduling dependencies.

The issue was not simply that the model was distributed. The issue was that the distributed layout did not match the physical network topology.

The existing placement strategy treated GPUs as mostly interchangeable. In practice, they were not. Some GPUs communicated over fast NVLink or NVSwitch paths, while others had to communicate over slower PCIe or cross-node network paths. When tightly coupled tensor-parallel ranks were placed across slower links, collective communication became part of the latency bottleneck.

The platform was experiencing several problems:

  • Tensor-parallel ranks were sometimes spread across slow communication paths
  • Collective operations such as all-reduce and all-gather were consuming too much serving time
  • Pipeline stages crossed nodes without a clear strategy for inter-node bandwidth and latency
  • GPUs completed local compute and then waited for data from other ranks
  • Cross-node communication competed with other network traffic
  • Scaling to more GPUs did not produce the expected throughput gains
  • Expensive accelerators were underutilized because communication, not compute, was limiting performance

The client had enough GPU capacity on paper, but the interconnect had become the real bottleneck.

Networking Architecture

We rebuilt the distributed inference layout around the hierarchy of GPU communication.

The core principle was simple: keep the heaviest communication on the fastest links.

Inside each node, we prioritized NVLink and NVSwitch for high-bandwidth GPU-to-GPU communication. This was especially important for tensor parallelism because tensor-parallel ranks exchange intermediate results frequently during model execution.

Across nodes, we used InfiniBand with RDMA and GPUDirect to reduce latency and avoid unnecessary CPU memory copies during GPU-to-GPU transfers. This was used for pipeline-parallel stages and any unavoidable cross-node communication.

The target architecture included:

  • NVLink and NVSwitch for intra-node GPU-to-GPU communication
  • InfiniBand for high-throughput, low-latency inter-node networking
  • RDMA and GPUDirect for direct GPU-to-network data movement
  • NCCL for optimized GPU collective communication
  • Tensor-parallel groups placed within fast intra-node GPU domains where possible
  • Pipeline-parallel stages placed across nodes more intentionally
  • Kubernetes scheduling constraints to preserve placement guarantees
  • Node labels, affinity rules, and topology-aware placement policies
  • Benchmarking and profiling to validate actual bandwidth and latency by path

This changed the infrastructure strategy from "find available GPUs" to "place model partitions according to how they communicate."

Pipeline Stage 1 (layers 0–N)Pipeline Stage 2 (layers N–M)NODE 1 · NVSWITCH DOMAINNODE 2 · NVSWITCH DOMAINGPU 0GPU 1GPU 2GPU 3NVSwitchTensor-parallel group · NVLinkGPU 4GPU 5GPU 6GPU 7NVSwitchTensor-parallel group · NVLinkInfiniBandRDMA · GPUDirectpipeline-stage transferNVLink / NVSwitch — intra-node (fastest)InfiniBand / RDMA — inter-node
Interconnect topology — tensor-parallel ranks stay within a node on NVLink/NVSwitch (fast); pipeline stages span nodes over InfiniBand with RDMA/GPUDirect.

Implementation

We began by measuring the actual communication behavior of the cluster.

Instead of assuming that all GPUs and nodes had equivalent performance, we benchmarked the interconnect paths directly. We measured NCCL collective performance across GPU pairs, NVLink domains, PCIe paths, and node boundaries. The profiling focused on the communication patterns that matter in distributed inference — all-reduce, all-gather, reduce-scatter, broadcast, point-to-point GPU transfers, and cross-node pipeline-stage transfers.

Once we understood the real bandwidth and latency characteristics of the cluster, we mapped the model's parallelism strategy onto the topology.

Tensor-parallel groups were pinned within NVLink or NVSwitch domains wherever possible. This kept the most frequent collective communication on the fastest available intra-node links.

Pipeline-parallel stages were placed across nodes more deliberately, using InfiniBand-backed paths for inter-node communication. This allowed the serving architecture to reserve the fastest intra-node paths for tensor-parallel synchronization while still scaling the model across multiple machines.

We also tuned NCCL behavior for the target environment. The work included validating network interface selection, rank ordering, topology discovery, and collective algorithm behavior. The goal was not to blindly modify NCCL settings, but to confirm that communication was flowing through the intended high-performance paths.

Key implementation work included:

  • Profiling NCCL collectives across intra-node and inter-node paths
  • Measuring real bandwidth and latency across NVLink, NVSwitch, PCIe, and InfiniBand
  • Identifying which parts of the model were most sensitive to collective communication overhead
  • Pinning tensor-parallel groups within NVLink or NVSwitch domains
  • Placing pipeline-parallel stages across InfiniBand-connected nodes
  • Configuring Kubernetes placement rules so tightly coupled ranks were not scheduled across slow links
  • Validating RDMA and GPUDirect configuration for cross-node GPU communication
  • Tuning NCCL topology behavior, rank ordering, and network interface selection
  • Running production-like load tests using realistic prompt lengths, output lengths, and concurrency
  • Comparing end-to-end latency, token throughput, GPU utilization, and collective timing before and after the redesign

The work combined networking, Kubernetes scheduling, and model-serving architecture. The placement strategy had to understand both the hardware topology and the communication behavior of tensor and pipeline parallelism.

relative bandwidth →NVLink / NVSwitchintra-nodefastestInfiniBand · RDMAinter-nodePCIeintra-nodeCross-node networkgenericslowest
Measured bandwidth hierarchy — the heaviest collective communication is placed on the fastest available paths, from NVLink/NVSwitch down to generic cross-node links.

Outcome

The interconnect stopped being the dominant bottleneck in the serving path.

By aligning the model's parallelism layout with the physical GPU network topology, the platform reduced communication stalls and improved effective GPU utilization. Tensor-parallel communication stayed on faster intra-node paths, while pipeline-parallel traffic crossed nodes in a more controlled and intentional way.

The client gained:

  • A topology-aware placement strategy for distributed inference
  • More efficient tensor-parallel and pipeline-parallel execution
  • Reduced time spent in GPU collective communication
  • Better utilization of NVLink, NVSwitch, InfiniBand, RDMA, and GPUDirect
  • Lower risk of GPUs sitting idle while waiting on data movement
  • Higher large-model serving throughput on the same hardware class
  • A repeatable benchmarking process for future GPU cluster changes
  • Clear guidance on when to scale up within a node versus scale out across nodes

Most importantly, the team now had a practical framework for distributed inference performance. Model size, GPU memory, parallelism strategy, and network topology were no longer treated as separate concerns. They were treated as one system.

Technologies

NVLinkNVSwitchInfiniBandRDMAGPUDirectNCCLTensor parallelismPipeline parallelismKubernetesTopology-aware schedulingGPU node poolsDistributed inference benchmarking

Facing a similar challenge?

Let's discuss how we can help your team reach production with confidence.