BACK

Overview and Historical Context
- 1969 moon landing control center had 12 kiloflops of compute power.
- Early supercomputers like CDC 6600 had 3 megaflops.
- Modern devices (smartphones, laptops) now have teraflop-level power, millions of times faster than early supercomputers.
- The Top 500 list ranks the world's fastest supercomputers; current champion "El Capitan" offers 1.7 exaflops, occupying an entire server room floor.
- Example given of a Polish company in 2008 that bought a supercomputer for business (Nasza Klasa), highlighting commercial HPC use.

Project Case Study: Cloud-based Supercomputer Deployment
- Client demand exceeded single-instance P5 supercomputer limits (16 petaflops per node).
- Cloud instances used included AWS P5 (with eight H100 GPUs, 640 GB graphic memory, 32 fiber optics, 3.2 Tbps network).
- Challenge: Need to cluster nodes together efficiently with separate billing and dynamic scaling across organizations.
- The project combined ultra cluster HPC with Kubernetes for managing business-specific features unique to participating organizations.
- Hardware initially only in the US; successfully shipped to Europe for deployment.
- Worked closely with AWS and NVIDIA teams for custom physical infrastructure and software support.

Technical Challenges and Solutions
1. Network Latency and Placement Groups
- Latency from multiple hops between servers affects performance dramatically.
- AWS placement groups used to collocate instances in one rack/switch to minimize latency (selected "cluster" group).

2. Network Bandwidth Aggregation
- Each server has 32 network cards; combined via network trunking into a single high-speed interface.
- Elastic Fabric Adapter (EFA) driver installed to unify connections, yielding test speeds above 3.2 Tbps.

3. Memory Access and RDMA
- Used Remote Direct Memory Access (RDMA) technology to combine memory of multiple GPUs, creating a unified large memory space for easier and scalable computation.
- This avoids splitting computation awkwardly across individual GPUs.

4. Data Storage and Access
- Data already in the cloud, but needed to load it efficiently onto the cluster.
- Options considered: NVMe drives (fast but volatile), EBS volumes (complex to manage), and S3 storage (slow but scalable).
- Chose "S3 Express One Zone" for very fast single-zone S3 access, improving data preloading performance.
- Learned that SDK/language choice impacts high-performance cloud storage access—optimized accordingly.
- Implemented data sharding and multi-instance parallel access to saturate S3 bandwidth.

Insights and Final Thoughts
- Cloud supercomputer deployment requires close hardware and software collaboration; significant groundwork needed before easy UI deployment.
- Data transfer and network infrastructure play critical roles in HPC cloud success.
- Highlighted that cloud is still "someone else’s computer" and that hardware limits cloud performance.
- Offered a fun community engagement activity with a call to action to provide software engineering pain points in exchange for branded t-shirts.
- Emphasized improving developer productivity by providing secure, cloud-connected laptops without traditional IT overhead (e.g., printer setup).

Actionable Items/Tasks
- Consider clustered placement groups on AWS for minimizing latency in distributed HPC workloads.
- Use network trunking and Elastic Fabric Adapter (EFA) drivers to aggregate multiple network interfaces into a high-speed pipe.
- Employ RDMA technology to treat multiple GPUs as unified large memory for scalable compute.
- Optimize cloud storage access by choosing fast single-zone S3 and appropriate SDK/language.
- Structure and shard data to maximize parallel data access without bottlenecks.
- Engage engineering teams to share backlog pain points to guide future tooling and project ideas.
- Explore secure, pre-configured laptop deployment strategies to enhance remote developer productivity.

When Size Matters: The Cool Kids' Guide to High-Performance Computing in the Cloud

Share:

10:00 - 10:30, 28th of May (Wednesday) 2025 / DEV ARCHITECTURE STAGE

When facing performance issues, it’s easy to be tempted to choose a “bigger boat” (vertical scaling). However, what do you do when you’ve already reached the limits of the largest available option? In that case, you need to consider horizontal scaling. However, this approach may not be as straightforward as it seems.

This case study focuses on a project where we needed to connect several P5 instances, the highest-performance GPU-based servers available. We utilized 32 fiber-optic cards, establishing a robust connection for an impressive throughput of 3.2 Tb/s between servers. Each server houses 8 NVIDIA H100 graphics cards with 600 GB of GPU memory. Although the final cluster is relatively small—totaling 1.8 TB of GPU memory, 576 CPUs, and 6 TB of RAM—it's just the beginning of what we aim to achieve.

LEVEL:
Basic Advanced Expert
TRACK:
Cloud DevOps Software Architecture
TOPICS:
AI Cloud FutureTrends ITarchitecture Kubernetes

Jacek Marmuszewski

Let’s Go DevOps