When Size Matters: The Cool Kids' Guide to High-Performance Computing in the Cloud

10:00 - 10:30, 28th of May (Wednesday) 2025 / DEV ARCHITECTURE STAGE

When facing performance issues, it’s easy to be tempted to choose a “bigger boat” (vertical scaling). However, what do you do when you’ve already reached the limits of the largest available option? In that case, you need to consider horizontal scaling. However, this approach may not be as straightforward as it seems.

This case study focuses on a project where we needed to connect several P5 instances, the highest-performance GPU-based servers available. We utilized 32 fiber-optic cards, establishing a robust connection for an impressive throughput of 3.2 Tb/s between servers. Each server houses 8 NVIDIA H100 graphics cards with 600 GB of GPU memory. Although the final cluster is relatively small—totaling 1.8 TB of GPU memory, 576 CPUs, and 6 TB of RAM—it's just the beginning of what we aim to achieve.

LEVEL:

Basic Advanced Expert

TRACK:

Cloud DevOps Software Architecture

TOPICS:

AI Cloud FutureTrends ITarchitecture Kubernetes

Jacek Marmuszewski

Let’s Go DevOps