Foundation Model Training
From Infrastructure to Evaluation, Debugging & Optimization
Virtual Summit, April 30th
Join us for Foundation Model Training: From Infrastructure to Evaluation, Debugging & Optimization
Join us for an exclusive technical summit where leading foundation-model researchers and practitioners converge to tackle real-world challenges in foundation model training.
This immersive event bridges theory and practice, offering AI researchers and practitioners training foundation models a rare opportunity to exchange battle-tested approaches for infrastructure scaling, debugging model internals, evaluation, and optimization.
Focus Areas
1) Infrastructure Debugging & Monitoring
- Diagnosing performance bottlenecks in multi-GPU / multi-node setups
- Instrumenting pipelines for deep observability (profiling GPU utilization, data flow, etc.)
- Correlating infrastructure metrics with model states (loss, gradients) in real time
- Failure detection and recovery strategies in distributed or HPC environments
2) Model Internals & Debugging
- Techniques for analyzing attention and activation patterns (layer-by-layer visualizations)
- Identifying and fixing gradient issues (vanishing, exploding, partial inactivity)
- Debugging architectural or layer-level bottlenecks
- Leveraging interpretability to guide early-phase debugging (during pre-training)
3) Evaluation
- Designing targeted test sets and adversarial evaluations for foundation models
- Error analysis frameworks to uncover overlooked failures or biases
- Establishing benchmarks for generalization, robustness, and emergent capabilities
- Integrating evaluation signals back into hyperparameter tuning and model iteration
4) Pre-Training Optimization
- Hyperparameter optimization at foundation-model scale (e.g., population-based training)
- Data pipeline throughput (streaming, multi-threaded I/O, sharding)
- Memory-saving strategies for large context windows (activation checkpointing, gradient sharding)
- Accelerating convergence (curriculum learning, dynamic batching, advanced scheduling)
Event Schedule
12:15 PM to 12:45 PM ET
Turbocharging Foundation Model Training: Cutting-Edge Strategies for Faster Convergence
Rahul Raja, Staff Software Engineer
12:45 PM to 1:15 PM ET
Scaling Down, Powering Up: Can Efficient Training Beat Scaling Laws
Malikeh Ehghaghi, Machine Learning Research Scientist
1:15 PM to 1:45 PM ET
Fireside Chat on the State of FM Training Report
Paulina Prachnio, Co-Founder | Kilian Kluge, CRO Editor
1:45 PM to 2:15 PM ET
Training Arctic Embed -- Text Embedding Models For Search
Luke Merric, AI Research