The client was serving a large model that could not fit on a single GPU. To run the model in production, the serving stack used tensor parallelism across GPUs and pipeline parallelism across nodes.
The model was functional, but performance was being limited by communication overhead. GPUs were spending too much time waiting on collective operations, cross-node transfers, and synchronization instead of doing useful compute.
We redesigned the networking, scheduling, and model placement strategy around the physical GPU interconnect topology. The goal was to make sure the most communication-heavy parts of the workload stayed on the fastest available paths, using NVLink and NVSwitch inside the node and InfiniBand with RDMA and GPUDirect across nodes.
The model was too large to fit on a single GPU, so it was distributed across multiple accelerators.
Tensor parallelism was used to split individual model operations across GPUs. This helped the model fit in memory and allowed large matrix operations to be distributed, but it also introduced frequent communication between GPUs during inference.
Pipeline parallelism was used to split groups of layers across multiple GPUs and nodes. This made it possible to scale the model beyond one machine, but it introduced stage-to-stage communication and scheduling dependencies.
The issue was not simply that the model was distributed. The issue was that the distributed layout did not match the physical network topology.
The existing placement strategy treated GPUs as mostly interchangeable. In practice, they were not. Some GPUs communicated over fast NVLink or NVSwitch paths, while others had to communicate over slower PCIe or cross-node network paths. When tightly coupled tensor-parallel ranks were placed across slower links, collective communication became part of the latency bottleneck.
The platform was experiencing several problems:
The client had enough GPU capacity on paper, but the interconnect had become the real bottleneck.
We rebuilt the distributed inference layout around the hierarchy of GPU communication.
The core principle was simple: keep the heaviest communication on the fastest links.
Inside each node, we prioritized NVLink and NVSwitch for high-bandwidth GPU-to-GPU communication. This was especially important for tensor parallelism because tensor-parallel ranks exchange intermediate results frequently during model execution.
Across nodes, we used InfiniBand with RDMA and GPUDirect to reduce latency and avoid unnecessary CPU memory copies during GPU-to-GPU transfers. This was used for pipeline-parallel stages and any unavoidable cross-node communication.
The target architecture included:
This changed the infrastructure strategy from "find available GPUs" to "place model partitions according to how they communicate."
We began by measuring the actual communication behavior of the cluster.
Instead of assuming that all GPUs and nodes had equivalent performance, we benchmarked the interconnect paths directly. We measured NCCL collective performance across GPU pairs, NVLink domains, PCIe paths, and node boundaries. The profiling focused on the communication patterns that matter in distributed inference — all-reduce, all-gather, reduce-scatter, broadcast, point-to-point GPU transfers, and cross-node pipeline-stage transfers.
Once we understood the real bandwidth and latency characteristics of the cluster, we mapped the model's parallelism strategy onto the topology.
Tensor-parallel groups were pinned within NVLink or NVSwitch domains wherever possible. This kept the most frequent collective communication on the fastest available intra-node links.
Pipeline-parallel stages were placed across nodes more deliberately, using InfiniBand-backed paths for inter-node communication. This allowed the serving architecture to reserve the fastest intra-node paths for tensor-parallel synchronization while still scaling the model across multiple machines.
We also tuned NCCL behavior for the target environment. The work included validating network interface selection, rank ordering, topology discovery, and collective algorithm behavior. The goal was not to blindly modify NCCL settings, but to confirm that communication was flowing through the intended high-performance paths.
Key implementation work included:
The work combined networking, Kubernetes scheduling, and model-serving architecture. The placement strategy had to understand both the hardware topology and the communication behavior of tensor and pipeline parallelism.
The interconnect stopped being the dominant bottleneck in the serving path.
By aligning the model's parallelism layout with the physical GPU network topology, the platform reduced communication stalls and improved effective GPU utilization. Tensor-parallel communication stayed on faster intra-node paths, while pipeline-parallel traffic crossed nodes in a more controlled and intentional way.
The client gained:
Most importantly, the team now had a practical framework for distributed inference performance. Model size, GPU memory, parallelism strategy, and network topology were no longer treated as separate concerns. They were treated as one system.