Foundation Model Training

From Infrastructure to Evaluation, Debugging & Optimization

Virtual Summit, April 30th

Join us for Foundation Model Training: From Infrastructure to Evaluation, Debugging & Optimization

Join us for an exclusive technical summit where leading foundation-model researchers and practitioners converge to tackle real-world challenges in foundation model training.

This immersive event bridges theory and practice, offering AI researchers and practitioners training foundation models a rare opportunity to exchange battle-tested approaches for infrastructure scaling, debugging model internals, evaluation, and optimization.

Focus Areas

1) Infrastructure Debugging & Monitoring
  • Diagnosing performance bottlenecks in multi-GPU / multi-node setups
  • Instrumenting pipelines for deep observability (profiling GPU utilization, data flow, etc.)
  • Correlating infrastructure metrics with model states (loss, gradients) in real time
  • Failure detection and recovery strategies in distributed or HPC environments
2) Model Internals & Debugging
  • Techniques for analyzing attention and activation patterns (layer-by-layer visualizations)
  • Identifying and fixing gradient issues (vanishing, exploding, partial inactivity)
  • Debugging architectural or layer-level bottlenecks
  • Leveraging interpretability to guide early-phase debugging (during pre-training)
3) Evaluation
  • Designing targeted test sets and adversarial evaluations for foundation models
  • Error analysis frameworks to uncover overlooked failures or biases
  • Establishing benchmarks for generalization, robustness, and emergent capabilities
  • Integrating evaluation signals back into hyperparameter tuning and model iteration
4) Pre-Training Optimization
  • Hyperparameter optimization at foundation-model scale (e.g., population-based training)
  • Data pipeline throughput (streaming, multi-threaded I/O, sharding)
  • Memory-saving strategies for large context windows (activation checkpointing, gradient sharding)
  • Accelerating convergence (curriculum learning, dynamic batching, advanced scheduling)

Event Schedule

12:10 PM - 12:15 AM ET

Opening Remarks

12:15 PM to 12:45 PM ET

Turbocharging Foundation Model Training: Cutting-Edge Strategies for Faster Convergence

Rahul Raja, Staff Software Engineer

12:45 PM to 1:15 PM ET

Scaling Down, Powering Up: Can Efficient Training Beat Scaling Laws

Malikeh Ehghaghi, Machine Learning Research Scientist

vector-institute-logo

1:15 PM to 1:45 PM ET

Fireside Chat on the State of FM Training Report

Paulina Prachnio, Co-Founder | Kilian Kluge, CRO Editor

1:45 PM to 2:15 PM ET

Training Arctic Embed -- Text Embedding Models For Search

Luke Merric, AI Research

2:15 PM to 2:45 PM ET

Pretraining on AMD MI300X using ScalarLM

Greg Diamo, Founder,

Event Speakers

Rahul Raja

Staff Software Engineer, Linkedin

Talk: Turbocharging Foundation Model Training: Cutting-Edge Strategies for Faster Convergence

Malikeh Ehghaghi

Machine Learning Research Scientist, Vector Institute

Talk: Scaling Down, Powering Up: Can Efficient Training Beat Scaling Laws

Paulina Prachnio

Co-Founder, neptune.ai

Talk: Fireside Chat on the State of FM Training Report

Kilian Kluge

CRO Editor, neptune.ai

Talk: Fireside Chat on the State of FM Training Report

Greg Diamos

Founder, MLCommons

Talk: Pretraining on AMD MI300X using ScalarLM

Luke Merrick

AI Research, Snowflake

Talk: Training Arctic Embed -- Text Embedding Models For Search

Tickets

Join Our Community

Our goal is to provide an open, inclusive community of ML practitioners who can share projects, best practices and case studies. Join our open group, meet our community and share your work with practitioners from around the world. Join us here and learn more:
Talk: Turbocharging Foundation Model Training: Cutting-Edge Strategies for Faster Convergence

Presenter:
Rahul Raja, Staff Software Engineer, Linkedin

About the Speaker:
Rahul Raja is a Staff Engineer at LinkedIn, specializing in search, machine learning infrastructure, and recommender systems. With expertise in vector search, AI-powered recommendations, and large-scale ML systems, he has contributed to research in LLMs, NLP, and generative AI. Rahul has reviewed for top AI conferences, including ICLR and NAACL, and has spoken at events like Meta’s Systems @Scale conference.

Talk Track: Pre-Training Optimization

Talk Technical Level: 5

Talk Abstract:
Training large-scale foundation models is a resource-intensive process that often requires extensive time and computational power. In this session, we will explore advanced techniques to significantly reduce the training time and accelerate convergence without sacrificing model quality. From dynamic batching and mixed precision training to curriculum learning and gradient accumulation, we will cover a variety of strategies designed to optimize training workflows. Attendees will also learn how cutting-edge techniques like adaptive optimization, efficient data pipelines, and distributed training frameworks can streamline the process. This session aims to provide actionable insights and best practices to help researchers and practitioners improve their model training efficiency, ultimately leading to faster development cycles and more scalable foundation models.

What You’ll Learn
– Advanced techniques for faster convergence (e.g., dynamic batching, mixed precision).
– Optimized workflows with curriculum learning and adaptive optimization.
– Insights on distributed training (Horovod, DeepSpeed).
– Memory optimization via activation checkpointing and gradient compression.
– Efficient data pipeline strategies for quicker training.

Talk: Scaling Down, Powering Up: Can Efficient Training Beat Scaling Laws

Presenter:
Malikeh Ehghaghi, Machine Learning Research Scientist, Vector Institute

About the Speaker:
Malikeh is a machine learning researcher at the Vector Institute, where she will work under the supervision of Professor Colin Raffel. She is a bilingual researcher fluent in Farsi and English who immigrated to Canada in 2019. She earned an MScAC degree in Computer Science from the University of Toronto and has over five years of industry research experience at companies such as Winterlight Labs, Cambridge Cognition, and Arcee AI. Her work focuses on Modular ML, Model Merging, Efficient LLMs, and Interpretability and Fairness, with publications in top conferences like EMNLP, COLING, ACL, MICCAI, and AAAI. Beyond AI, Malikeh is passionate about psychology, history, and politics. She finds inspiration in music and nature and plays the Santur, a traditional Persian instrument.

Talk Track: Pre-Training Optimization

Talk Technical Level: 5

Talk Abstract:
​The traditional belief that scaling up model parameters and data volume is the sole path to enhanced performance in large language models (LLMs) is being challenged by innovative strategies that prioritize efficiency and cost-effectiveness. ​​​​DeepSeek’s success story serves as a testament to the potential of thoughtful data engineering and meticulous model design in achieving superior AI performance without incurring prohibitive costs. This presentation delves into an overview of state-of-the-art data-centric and model-centric strategies for training language models, aiming to achieve optimal performance at minimal costs. We first talk about the rise of small language models (SLMs) as cost-efficient alternatives for dense large language models. On the data-centric side, we explore techniques such as data mixing, filtering, or deduplication to enhance dataset quality. On the model-centric front, we cover advanced approaches including pruning, distillation, parameter-efficient finetuning, quantization, and model merging to streamline model architectures without compromising performance. Together, these approaches demonstrate that strategic data preparation and model training can produce superior language models without the massive financial investments traditionally considered necessary for scaling AI systems.

What You’ll Learn
Learning the key topics/areas in efficient training of large language models

Talk: Fireside Chat on the State of FM Training Report

Presenters:
Paulina Prachnio, Co-Founder, neptune.ai | Kilian Kluge, CRO Editor, neptune.ai

About the Speaker:
Paulina is a co-founder at neptune.ai, where she collaborates with dozens of top AI research organizations to deeply understand their workflows, infrastructure challenges, and evolving needs in foundation model training. With over seven years in the MLOps space, she ensures that Neptune addresses the needs of teams training their own foundation models, helping them monitor training stability in real time, debug issues quickly, and analyze training jobs at scale.

Kilian Kluge is the developmental and technical editor for Neptune’s blog, covering generative AI, ML engineering, and current academic research.

Talk Track:
TBA

Talk Technical Level: 3

Talk Abstract:
​Join Paulina Prachnio, co-founder and CRO at neptune.ai, and Kilian Kluge, editor of the State of FM Training Report, for a focused discussion on the challenges and best practices of training foundation models across teams of all sizes. This session will examine the report’s key findings, including why companies decide to train foundation models, best practices in hiring for foundational model teams, and current trends like on-premise training infrastructure and models for modalities beyond text and images. A must-attend for those navigating the technical and organizational complexities of building foundation models.

What You’ll Learn
TBA

Talk: Pretraining on AMD MI300X using ScalarLM

Presenter:
Greg Diamos, Founder, MLCommons

About the Speaker:
Greg is a founder of MLPerf™, the industry standard benchmark for deep learning performance, and several AI startups including Lamini. He was a founding engineer at Baidu’s Silicon Valley AI Lab (SVAIL), where he co-invented and wrote the framework for the first 1,000 CUDA GPU training cluster and trained the first deep learning speech and language model deployed in production to billions of users. At Baidu, he discovered Scaling Laws, authoring the first paper 2 years before OpenAI, motivating LLMs. Greg has been a CUDA SW Architect at NVIDIA and an AMD Fellow. Greg holds a PhD from Georgia Tech.

Talk Track: Pre-Training Optimization

Talk Technical Level: 5

Talk Abstract:
The emergence of large language models (LLMs) has revolutionized AI, but hardware-specific optimization remains a significant challenge. In this talk, Greg Diamos shares his experience building ScalarLM, a novel framework that unifies training and inference workloads for LLMs on AMD MI300X GPUs. Diamos discusses the unique architectural considerations that influenced ScalarLM’s design, highlighting how the framework leverages the MI300X’s high memory bandwidth and compute density to achieve superior performance for both training and serving scenarios. The presentation covers ScalarLM’s innovative memory management techniques, its dynamic kernel fusion approach, and the custom CDNA3 architecture optimizations that enable efficient scaling from single-GPU deployments to multi-node clusters. Diamos also addresses the challenges encountered during development, including HIP programming model adaptations and workload-specific tuning, while providing quantitative performance comparisons against existing frameworks. This talk offers valuable insights for researchers and engineers working to optimize LLM workloads across diverse hardware platforms.

What You’ll Learn
Lean how to train reasoning models on AMD MI300X GPUs

Talk: Training Arctic Embed -- Text Embedding Models For Search

Presenter:
Luke Merrick, AI Research, Snowflake

About the Speaker:
Luke is a machine learning practitioner with a passion for transforming the bleeding edge of AI research into practical tools. His work at Snowflake focuses on applying advances in deep learning to improve information retrieval workflows through projects like the Snowflake Arctic Embed model family.

Talk Track: Pre-Training Optimization

Talk Technical Level: 4

Talk Abstract:
Key Audience Takeaways:
– Basic understanding of embedding models, contrastive training, and differences from generative LMs
– Best practices for training high-quality text embedding models
– Concrete case studies building practical model training systems

What You’ll Learn
Outline:
– Introduction – Text embedding models, their rise to prominence in search, and how neural search works in general.
– Data-driven ML for text embedding models – Challenges in creating high-quality datasets. Data collection, filtering, and composition. Data diversity and source stratification.
– Training – Two-stage training process: contrastive pretraining and contrastive fine-tuning, our data-parallel training implementations in pure PyTorch and via DeepSpeed ZeRO stage 1.
– Iterative improvements – Importance of hard-negative mining in fine-tuning dataset construction. Ablation study results. Dimensionality reduction of embedding output for downstream efficiency.
– Q&A