Name: A Distributed Stateful Dataloader for Large-Scale Pretraining - Davis Wertheimer, IBM & Linsong Chu, IBM Research
Start: 2024-09-18T14:55:00-0700
End: 2024-09-18T15:20:00-0700

September 18-19, 2024
San Francisco, California
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Pacific Daylight Time (UTC-7). To see the schedule in your preferred timezone, please select from the drop-down located at the bottom of the menu to the right.

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Wednesday September 18, 2024 2:55pm - 3:20pm PDT

Room C

Large-scale model pretraining crucially relies on specialized and dedicated dataloaders that can, for example, partition and stream data asynchronously across multiple processes and physical nodes. In this talk we discuss one of the torch-native dataloaders we built and use at IBM Research for addressing these needs. Intended for use in large-scale model pretraining, particularly in research settings where rapid iteration between datasets may be required, our dataloader is distributed, stateful, checkpointable, composable and rescalable – while remaining a simple extension of the existing PyTorch dataloading framework. It automatically and invisibly handles data sharding, shuffling, subdataset weighting, checkpoint saving and loading, and custom user-defined preprocessing functions, with minimal overhead and high throughput. We discuss these properties and how we achieved them, such as reducing overhead by implementing a custom LCG random number generator, and demonstrate proof of concept on production-scale training of a 7B parameter Llama model over 4 trillion tokens.

Speakers

Davis Wertheimer

Staff Research Scientist, IBM

Davis Wertheimer earned his Ph.D. in Computer Science at Cornell University in 2022, conducting research under Bharath Hariharan on few-shot learning and machine learning under constraints. He now researches and develops AI models for IBM, training and accelerating large language... Read More →

LINSONG CHU

Senior Technical Staff Member, IBM Research

Linsong is a STSM at IBM Research, focusing on FSDP, torch compile and FP8 in the area of pre-training.

Wednesday September 18, 2024 2:55pm - 3:20pm PDT
Room C

Breakout Sessions

Audience Any

PyTorch Conference 2024

Davis Wertheimer

LINSONG CHU

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!