BACK

Importance of Memory in AI/ML
- Growing AI model sizes: from 175 billion (GPT-3) to trillion+ parameters.
- GPU architecture divides into processing cores and high-bandwidth memory (HBM).
- Memory is 40-50% of GPU cost and often a bottleneck, causing GPUs to be idle 50-60% of the time.
- Memory limits model size and speed; optimizing memory usage reduces costs and increases GPU utilization.
- Real-world examples of high personalization at Netflix, requiring efficient model deployment.

Memory Usage and Model Components
- Models store parameters as 32-bit floating-point numbers (each a "parameter").
- Larger models require large memory (gigabytes to terabytes).
- Training requires additional memory for gradients, activations, forward/back propagation.
- Inference also consumes significant memory.

Benefits of Memory Optimization
- Fewer GPUs needed, saving cost.
- Faster inference times (up to 40% speed improvement).
- Lower energy consumption for sustainability.
- Enables deployment on edge devices (e.g., mobile phones like Apple’s Siri).
- Enables privacy-preserving federated learning for sensitive data (e.g., medical).

Techniques for Memory Optimization

1. Quantization
- Converting 32-bit floating parameters to 8-bit integers reduces model size by ~4x.
- Two types: post-training quantization and quantization-aware training (QAT).
- Post-training quantization is simpler but may reduce accuracy more.
- QAT requires retraining but maintains better accuracy.
- Quantization can be uniform or non-uniform (dynamic range selection).
- Netflix uses PyTorch with simple code changes for quantization.
- Results show up to 2.5x inference speedup with minor accuracy loss that can be fine-tuned.

Actionable Task: Apply quantization (post-training or QAT) to models to reduce size and improve inference speed, followed by fine-tuning to recover accuracy.

2. Model Pruning
- Removing unnecessary or zero-value parameters.
- Two types: unstructured pruning (random zeroing out) and structured pruning (removing entire layers or neurons).
- Structured pruning preferred as it speeds up matrix multiplication.
- Benefits include memory reduction and inference speed increase.
- Netflix results show pruning with fine-tuning achieves accuracy retention/improvement and speed gains.

Actionable Task: Implement structured pruning followed by fine-tuning to minimize model size while maintaining accuracy and speeding up inference.

3. Knowledge Distillation
- Large "teacher" model trains smaller "student" model to mimic it.
- Student learns "dark knowledge" - probability distributions not just final outputs.
- Enables smaller models with good accuracy (retain ~95% accuracy), much faster inference (3-10x).
- Allows diverse student architectures, not necessarily same as teacher.
- Used in smaller GPT models and other state-of-the-art compressed models.

Actionable Task: Use knowledge distillation to train smaller models from large models for deployment with high accuracy and speed improvements.

Other Considerations
- Batch size impacts memory and training time trade-offs; dynamic batch sizing optimizes memory usage per hardware.
- Hardware characteristics matter: GPUs used for training; inference often done on CPUs for cost-efficiency.
- Real-world success stories at companies like Google (Pixel), Microsoft (Distilled BERT), and Berkeley (SqueezeNet).
- Combining pruning and quantization yields best size reduction (~75%).
- Quantization best for inference speed; distillation best for accuracy retention.
- Quantization has lowest implementation complexity; distillation and pruning require more training/retraining.
- Emerging research areas include neural architecture search and sparse computation hardware to optimize pruning benefits.

Summary Table for Choosing Techniques
- Size reduction priority: pruning + quantization
- Inference speed priority: quantization alone
- Accuracy critical: knowledge distillation
- Development time priority: quantization (minimal retraining)

---

Key Actionable Items:

- Evaluate existing ML models for memory usage and potential compression.
- Implement post-training quantization first; switch to quantization-aware training if accuracy suffers.
- Explore structured pruning to remove redundant parameters, followed by fine-tuning.
- Consider knowledge distillation when accuracy retention is critical, especially for deploying lightweight models.
- Adjust batch size dynamically based on available hardware for optimal training efficiency.
- Select appropriate hardware (CPU, GPU, TPU) for training and inference according to workload and cost.
- Monitor emerging research and hardware support for sparse computation and neural architecture search to improve models further.

Memory Optimizations for Machine Learning

Share:

13:20 - 13:50, 28th of May (Wednesday) 2025 / DEV ARCHITECTURE STAGE

As Machine Learning continues to forge its way into diverse industries and applications, optimizing computational resources, particularly memory, has become a critical aspect of effective model deployment. This session, "Memory Optimizations for Machine Learning," aims to offer an exhaustive look into the specific memory requirements in Machine Learning tasks, including Large Language Models (LLMs), and the cutting-edge strategies to minimize memory consumption efficiently.
We'll begin by demystifying the memory footprint of typical Machine Learning data structures and algorithms, elucidating the nuances of memory allocation and deallocation during model training phases. The talk will then focus on memory-saving techniques such as data quantization, model pruning, and efficient mini-batch selection. These techniques offer the advantage of conserving memory resources without significant degradation in model performance.
A special emphasis will be placed on the memory footprint of LLMs during inferencing. LLMs, known for their immense size and complexity, pose unique challenges in terms of memory consumption during deployment. We will explore the factors contributing to the memory footprint of LLMs, such as model architecture, input sequence length, and vocabulary size. Additionally, we will discuss practical strategies to optimize memory usage during LLM inferencing, including techniques like model distillation, dynamic memory allocation, and efficient caching mechanisms.
By the end of this session, attendees will have a comprehensive understanding of memory optimization techniques for Machine Learning, with a particular focus on the challenges and solutions related to LLM inferencing.

TRACK:
Cloud DevOps Software Architecture
TOPICS:
ML

Tejas Chopra

Netflix