Name: Maximizing Training Throughput Using Torch.Compile and FSDP - Linsong Chu & Antoni Viros i Martin, IBM Research; Brian Vaughan, IBM
Start: 2024-09-18T13:55:00-0700
End: 2024-09-18T14:20:00-0700

September 18-19, 2024
San Francisco, California
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Pacific Daylight Time (UTC-7). To see the schedule in your preferred timezone, please select from the drop-down located at the bottom of the menu to the right.

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Wednesday September 18, 2024 1:55pm - 2:20pm PDT

Room B

torch.compile is a graph compilation technique that improves GPU utilization. A key challenge in getting torch.compile to perform well is to minimize (or eliminate) graph breaks, however, this isn't trivial as even the Llama implementation provided by Meta has many graph breaks resulting in reduced training throughput. In this talk we discuss 1. how we addressed these challenges in order to train a model using torch.compile 2. how we combined torch.compile with FSDP and selective activation checkpointing to achieve the maximum throughput for training 3. model quality comparison between models trained with compile and no-compile, and lastly 4. the best setup we have for different model sizes in the Llama family that achieves the maximum throughput and MFU number (e.g. 68% MFU for the 7B model on A100 GPUs!)

Speakers

Antoni Viros i Martin

Staff Research Scientist, IBM Research

Antoni is currently a Research Scientist at IBM Research, investigating optimization approaches for ML inference and training, with a focus on open-source technologies such as PyTorch. He holds a PhD in Aerospace Engineering from Texas A&M University, and has previously worked at... Read More →

LINSONG CHU

Senior Technical Staff Member, IBM Research

Linsong is a STSM at IBM Research, focusing on FSDP, torch compile and FP8 in the area of pre-training.

Brian Vaughan

Senior Technical Staff Member, IBM

An STSM at IBM focusing on foundation models.

Wednesday September 18, 2024 1:55pm - 2:20pm PDT
Room B

Breakout Sessions

Audience Any

PyTorch Conference 2024

Antoni Viros i Martin

LINSONG CHU

Brian Vaughan

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!