PyTorch Conference 2024: Full Schedule

September 18-19, 2024
San Francisco, California
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Pacific Daylight Time (UTC-7). To see the schedule in your preferred timezone, please select from the drop-down located at the bottom of the menu to the right.

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

11:10am PDT

Sponsored Session: NeMo-Aligner: A Scalable Toolkit for Model Alignment - Gerald Shen & Jimmy Zhang, NVIDIA

Wednesday September 18, 2024 11:10am - 11:35am PDT

Festival Pavilion - Breakout Room B

Aligning AI models with human values and preferences is essential for making them safe and helpful. However, building an efficient and scalable toolkit for alignment can be challenging, especially when applied to state of the art foundation models with billions or trillions of parameters. NeMo-Aligner is an open-source, optimized and scalable toolkit that implements alignment algorithms such as Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), SteerLM and Self-Play Fine Tuning (SPIN). This talk will introduce NeMo-Aligner and show the steps we took to design and optimize the toolkit around various alignment algorithms. In particular, we discuss the RLHF implementation where we observe close to 7x speedup and excellent scaling performance by adding TRT-LLM integration, carefully orchestrating communication and utilizing fast training kernels. We’re able to align state-of-the-art open source models with NeMo-Aligner and hope our framework can enable the community to performantly customize, fine-tune and align foundational models at any scale.

Speakers

Gerald Shen

Engineer, NVIDIA

Gerald Shen is a member of the NVIDIA NeMo NLP Team specializing in model alignment. He leads the development of the NeMo-Aligner toolkit, a scalable toolkit to align large language models. This toolkit has been used to align models at NVIDIA with algorithms such as reinforcement... Read More →

Jimmy Zhang

Machine Learning Engineer, NVIDIA

Jimmy Zhang is a Senior Deep Learning Architect at NVIDIA. His work focuses on researching and developing the performance of deep learning frameworks, including NeMo and Megatron-LM. He completed his M.S. at UIUC where he was mentored under Professor Rakesh Kumar.

NeMo Aligner PyTorch Conference pdf

Wednesday September 18, 2024 11:10am - 11:35am PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

11:40am PDT

Lightning Talk: HieroGlyph2Text: A PyTorch-Powered Pipeline for Automated Egyptian Hieroglyph Translation from Image - Susi Gentsch, University of Bonn

Wednesday September 18, 2024 11:40am - 11:50am PDT

Festival Pavilion - Breakout Room B

HieroGlyph2Text is an innovative PyTorch-powered pipeline that automates the detection, classification, and attempts translation of Egyptian hieroglyphs from large image inputs. It addresses the challenge of decoding and translating ancient hieroglyphic inscriptions, traditionally a time-consuming and specialized task. This pipeline leverages PyTorch to create custom models: 1. Object Detection: YOLOv8 accurately detects individual hieroglyphs within images. 2. Image Classification: A custom ResNet model built using PyTorch achieves state-of-the-art accuracy in assigning Gardiner Codes to hieroglyphs. 3. Translation: The classified Gardiner Codes outputs from the ResNet model are integrated with Llama3, a large language model (LLM), using Retrieval-Augmented Generation (RAG) and a custom dataset based upon Gardiner Codes and their respective description and ideogram. Key highlights include accurate hieroglyph detection and state-of-the-art classification performance through an optimized ResNet model. This pipeline lays the groundwork for collaboration with subject matter experts to refine the translation process and democratize access to ancient Egyptian hieroglyphic knowledge.

Speakers

Susi Gentsch

Student, University of Bonn

Driven by applying deep learning to real-world challenges, Susi is a Computer Science student finishing her degree at the University of Bonn. Her projects include teaching a robot to autonomously detect and collect trash using YOLOv5 and ROS, and adapting YOLOv5 to identify archaeological... Read More →

HieroGlyph2Text Susanne Gentsch pdf

Wednesday September 18, 2024 11:40am - 11:50am PDT
Festival Pavilion - Breakout Room B

Lightning Talks

Audience Beginner

11:55am PDT

Lightning Talk: Mobile Computational Photography with PyTorch: Low-Light Denoising - Alexis Baudron, Sony

Wednesday September 18, 2024 11:55am - 12:05pm PDT

Festival Pavilion - Breakout Room B

Over the last decade, smartphone cameras have improved significantly, becoming the primary device people use for capturing everyday moments and high-quality photographs. This progress is largely due to advances in computational photography and novel image sensors. Computational photography enables great images from compact mobile cameras, enhancing photos through various techniques such as multi-shot merging. Despite these advancements, challenges such as noise, artifacts, and distortions persist, especially in low-light conditions where limited light increases noise levels. In this lightning talk, we will explore how PyTorch can be used to design and optimize deep learning networks for real-time low-light denoising. We will dive into noise modeling, data generation, physics-aware models, and advanced network architectures for effective denoising in challenging low-light scenarios. Attendees will gain practical insights into the latest advancements in mobile computational photography using PyTorch.

Speakers

Alexis Baudron

Senior AI Researcher, Sony

Alexis Baudron is a Senior AI Researcher at Sony, where his team specializes in building AI models to tackle complex computer vision challenges. His background is in computational photography, developing advanced techniques for image enhancement and artifact removal. Alexis earned... Read More →

Mobile Computational Photography pptx

Wednesday September 18, 2024 11:55am - 12:05pm PDT
Festival Pavilion - Breakout Room B

Lightning Talks

Audience Intermediate

2:10pm PDT

Maximizing Training Throughput Using Torch.Compile and FSDP - Linsong Chu & Antoni Viros i Martin, IBM Research; Brian Vaughan, IBM

Wednesday September 18, 2024 2:10pm - 2:35pm PDT

Festival Pavilion - Breakout Room B

torch.compile is a graph compilation technique that improves GPU utilization. A key challenge in getting torch.compile to perform well is to minimize (or eliminate) graph breaks, however, this isn't trivial as even the Llama implementation provided by Meta has many graph breaks resulting in reduced training throughput. In this talk we discuss 1. how we addressed these challenges in order to train a model using torch.compile 2. how we combined torch.compile with FSDP and selective activation checkpointing to achieve the maximum throughput for training 3. model quality comparison between models trained with compile and no-compile, and lastly 4. the best setup we have for different model sizes in the Llama family that achieves the maximum throughput and MFU number (e.g. 68% MFU for the 7B model on A100 GPUs!)

Speakers

Antoni Viros i Martin

Research Scientist, IBM Research

Antoni is currently a Research Scientist at IBM Research, investigating optimization approaches for ML inference and training, with a focus on open-source technologies such as PyTorch. He holds a PhD in Aerospace Engineering from Texas A&M University, and has previously worked at... Read More →

LINSONG CHU

Senior Technical Staff Member, IBM Research

Linsong is a STSM at IBM Research, focusing on FSDP, torch compile and FP8 in the area of pre-training.

Brian Vaughan

Senior Technical Staff Member, IBM

An STSM at IBM focusing on foundation models.

Maximizing Training Throughput Using Torch.compile and FSDP.pptx pdf

Wednesday September 18, 2024 2:10pm - 2:35pm PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Any

2:40pm PDT

Running State-of-Art Gen AI Models on-Device with NPU Acceleration - Felix Baum, Qualcomm

Wednesday September 18, 2024 2:40pm - 3:05pm PDT

Festival Pavilion - Breakout Room B

Since the boom of generative AI, the industry is now moving towards on-device AI inferencing, as it is not only a trend but a necessity now in order to save costs, achieve the best inference performance, ultra-low latency at the lowest power possible. In this session we go over the new features added on the Qualcomm AI Stack and how it works with the public release of ExecuTorch 1.0. We will discuss how to run traditional workloads as well as GenAI use cases including the latest version of Llama on the Mobile device while using Qualcomm Hexagon NPU.

Speakers

Felix Baum

Senior Director of Product Management, Qualcomm

Felix Baum has an extensive background of over two decades in the embedded industry, where he has excelled both as an embedded developer and a product manager. Currently he is responsible for AI Software Products at Qualcomm. Prior to that, he led efforts for various real-time operating... Read More →

Felix Baum PyTorch 2024 Session Running State of Art GenAI models on NPU pdf

Wednesday September 18, 2024 2:40pm - 3:05pm PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Intermediate
Slides Attached Yes

3:10pm PDT

TorchInductor CPU Backend Advancements: New Features and Performance Improvements - Jiong Gong & Leslie Fang, Intel

Wednesday September 18, 2024 3:10pm - 3:35pm PDT

Festival Pavilion - Breakout Room B

This presentation provides an update on the latest advancements in the TorchInductor CPU backend since the last conference to bring best-in-class CPU performance for broad DL workloads. We will discuss new features and performance enhancements, including: • Max-autotune support with codegen for GEMMs, boosting performance for GEMM-related operations • Enhanced vectorized codegen support, now covering all data types beyond floating points with flexible vector factors, and optimized loop scheduling • Comprehensive quantization support, including weight-only-quantization (WoQ), and optimizations for dynamic quantization and quantization-aware training • Improved Attention support, featuring attention masks and optimizating SoftMax via flash attention v2 etc. • AOTInductor support, enabling high-performance inference with frozen weights • Native Windows support, with improved vectorization capabilities These advancements, combined with ongoing optimizations, have resulted in significant performance improvements since PyTorch 2.1, demonstrated through extensive benchmarks and large language models (LLMs).

Speakers

Leslie Fang

Software Engineer, Intel

Leslie is a software engineer from Intel who works on PyTorch performance optimization on X86 servers for the past 4 years. Currently, he is mainly focusing on the feature domain of Quantization, Autocast, and Inductor CPP/OpenMP backend in Stock PyTorch.

Jiong Gong

Principle Engineer, Intel

Jiong is a software architect from Intel who works on PyTorch framework optimizations. He is the PyTorch module maintainer for CPU and compiler.

TorchInductor CPU Backend Advancements New Features and Performance Improvements 20240914 pdf

TorchInductor CPU Backend Advancements New Features and Performance Improvements 20240915 pptx

Wednesday September 18, 2024 3:10pm - 3:35pm PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Intermediate
Slides Attached Yes

4:00pm PDT

[HALIDE] A Halide Backend for TorchInductor - Jason Ansel, Meta

Wednesday September 18, 2024 4:00pm - 4:10pm PDT

Festival Pavilion - Breakout Room B

This talk will focus on a new Halide backend for TorchInductor, which is in addition to the existing Triton and C++ backends. The Halide backend is meant to serve as a reference backend to make it easier to extend TorchInductor to support new backend compilers and hardware devices. Halide has been the inspiration (either in ideas or through forking) of numerous other compiler projects, so it is a good starting point for adding new backends that follow a Halide-like model.

Speakers

Jason Ansel

Research Scientist, Meta

Jason Ansel is a Research Scientist at Meta AI and a technical lead for PyTorch compilers. He started the TorchDynamo and TorchInductor projects, which bring flexible graph capture and a high performance compiler to PyTorch 2. He received a Ph.D. from MIT CSAIL in 2014 with research... Read More →

PT2 + Halide @ PTC 2024 pdf

Wednesday September 18, 2024 4:00pm - 4:10pm PDT
Festival Pavilion - Breakout Room B

DL Compiler Mini-Summit

Slides Attached Yes

4:10pm PDT

[MLIR] Enabling Composition of Kernels and Compilers - Jacques Pienaar, Google

Wednesday September 18, 2024 4:10pm - 4:20pm PDT

Festival Pavilion - Breakout Room B

Hand written kernels and compilers have been part of the toolbox to provide efficient and broad coverage. These approaches have often been positioned as being at odds with one another - and indeed the software solutions either side have sometimes made it such. MLIR, since inception, aimed to enable general, beneficial composition instead. Rather than treating kernels as a black box escape hatch, treat it as a peer in solving the serving needs. This is not magic and requires consideration of how best to combine. In this talk I'll present the approach and effect of this both in IREE and OpenXLA.

Speakers

Jacques Pienaar

SWE, Google

Jacques Pienaar is a lead of the ML Compiler Systems Research team at Google Deepmind. In this role he focuses on accelerating and simplifying machine learning for high-performance model deployment across various architectures. He is one of the founders of MLIR, a founding member... Read More →

Enabling Composition of Kernels and Compilers pdf

Wednesday September 18, 2024 4:10pm - 4:20pm PDT
Festival Pavilion - Breakout Room B

DL Compiler Mini-Summit

4:20pm PDT

[TRITON] Maximizing Kernel Development Productivity Under Performance Constraints - Philip Tillet, OpenAI

Wednesday September 18, 2024 4:20pm - 4:30pm PDT

Festival Pavilion - Breakout Room B

Machine Learning research workflows are often bottlenecked by the development of compute kernels for new algorithms and GPU architectures. This process can be daunting, and often requires a careful trade-off between productivity and performance. In this talk, we will discuss how Triton -- a mid-level programming language for kernel development -- approaches this multi-objective optimization problem, and the design decisions that were made to that effect.

Speakers

Phil Tillet

Member Of Technical Staff, OpenAI

Phil first began working with GPUs in 2011 as a contributor to the ViennaCL library. He then received his B.S. from Telecom SudParis (France) in 2012, his M.S. from NCTU (Taiwan) in 2014, and his Ph.D. from Harvard University in 2020. He joined OpenAI full time in 2020 to pursue his... Read More →

Wednesday September 18, 2024 4:20pm - 4:30pm PDT
Festival Pavilion - Breakout Room B

DL Compiler Mini-Summit

4:30pm PDT

[TVM] Universally Deploy Large-language Models via ML Compilation - Tianqi Chen, CMU & OctoAI

Wednesday September 18, 2024 4:30pm - 4:40pm PDT

Festival Pavilion - Breakout Room B

Deploying deep learning models on various devices has become an important topic. Machine learning compilation is an emerging field that leverages compiler and automatic search techniques to accelerate AI models. ML compilation brings a unique set of challenges: emerging machine learning models; increasing hardware specialization brings a diverse set of acceleration primitives; growing tension between flexibility and performance. In this talk. I then discuss our experience in bringing foundational models to a variety of devices and hardware environments through machine learning compilation.

Speakers

Tianqi Chen

Assistant Professor, CMU

Tianqi Chen is currently an Assistant Professor at the Machine Learning Department and Computer Science Department of Carnegie Mellon University. He is also the Chief Technologist of OctoAI. He received his PhD. from the Paul G. Allen School of Computer Science & Engineering at the... Read More →

Wednesday September 18, 2024 4:30pm - 4:40pm PDT
Festival Pavilion - Breakout Room B

DL Compiler Mini-Summit

4:40pm PDT

[MOJO] Lifting PT to New Heights with MAX and Mojo - Mikhail Zolotukhin, Modular

Wednesday September 18, 2024 4:40pm - 4:50pm PDT

Festival Pavilion - Breakout Room B

In this talk we'll peek into Modular's inference engine: how it builds on and works with PyTorch and what is unique about it. We will look into how Mojo language can be used to define performant kernels and what optimizations the inference engine can perform. We will also talk briefly about our experience of developing a third party backend for torch.compile.

Speakers

Mikhail Zolotukhin

Software Engineering Manager, Modular

Mikhail is an open source enthusiast with contributions ranging from GCC and LLVM to PyTorch. Currently he is at Modular leading a team working on integration of Modular's inference stack with PyTorch.

[MOJO] Lifting PT to New Heights with MAX and Mojo pdf

Wednesday September 18, 2024 4:40pm - 4:50pm PDT
Festival Pavilion - Breakout Room B

DL Compiler Mini-Summit

4:50pm PDT

Together Goes Brrr: Threading Research & Production with Torch Compile - Pragaash Ponnusamy, together.ai

Wednesday September 18, 2024 4:50pm - 5:00pm PDT

Festival Pavilion - Breakout Room B

The deployment of large language models for inference at scale is inherently complex, often requiring intricate optimizations across compute-bound and memory-bound regimes. This talk explores how PyTorch's torch.compile has revolutionized the optimization landscape for LLM serving at Together AI. Through its sophisticated Dynamo tracer and Inductor backend, torch.compile has transformed the approach to critical performance bottlenecks in both prefill and decode phases of inference. We examine how automatic vertical fusion, epilogue optimization, and adaptive kernel generation across batch sizes for GEMV and GEMM workloads, addressing key efficiency concerns, from CUDA graph captures and optimized all-reduce strategies to custom kernel registrations. The presentation highlights Together AI's journey in leveraging torch.compile to streamline the transition from research to production, significantly simplifying the deployment process for even custom architectures. By automating many performance-critical optimizations, torch.compile has not only enhanced inference efficiency but also democratized high-performance LLM deployment. We'll conclude by sharing key lessons learned and best practices gleaned from Together AI's experience in deploying torch.compile to production, serving billions of user queries and navigating the complexities of large-scale LLM inference.

Speakers

Pragaash Ponnusamy

Senior Staff AI/ML Researcher, Together AI

Wednesday September 18, 2024 4:50pm - 5:00pm PDT
Festival Pavilion - Breakout Room B

DL Compiler Mini-Summit

5:00pm PDT

DL Compiler Panel Discussion - Philip Tillet, OpenAI; Jason Ansel, Meta; Jacques Pienaar, Google; Tianqi Chen, CMU & OctoAI; Mikhail Zolotukhin, Modular; Peng Wu, Meta

Wednesday September 18, 2024 5:00pm - 5:30pm PDT

Festival Pavilion - Breakout Room B

Since the release of PyTorch 2 in 2023, torch.compile() has spurred significant new thinking around DL compiler designs at the framework level. In this session, we invite leaders in this space to share their insights based on real experiences of building DL compilers – Triton, TorchInductor, Halide, TVM, OpenXLA, and Mojo – and growing their ecosystems. We also invite a ‘compiler user representative,’ together.ai, to share their recent journey of redesigning the LLM inference stack around torch.compile(). Each leader will give a 10-minute lightning talk and an engaging panel discussion.

Speakers

Peng Wu

Engineering Manager, Meta

Dr. Peng Wu is the engineering manager of the PyTorch Compiler team at Meta. Dr. Wu spent over a decade at IBM research, working on many aspects of programming systems. She then founded the Programming Technologies Lab at Huawei and led its growth for six years. At Meta, she... Read More →

Phil Tillet

Member Of Technical Staff, OpenAI

Mikhail Zolotukhin

Software Engineering Manager, Modular

Tianqi Chen

Assistant Professor, CMU

Jacques Pienaar

SWE, Google

Jason Ansel

Research Scientist, Meta

Wednesday September 18, 2024 5:00pm - 5:30pm PDT
Festival Pavilion - Breakout Room B

DL Compiler Mini-Summit

10:50am PDT

The Rise of `Transformers` in the Growing PyTorch Ecosystem - Arthur Zucker, Hugging Face

Thursday September 19, 2024 10:50am - 11:15am PDT

Festival Pavilion - Breakout Room B

Explore how the `tranformers` library grows and adapts to the fast paced and ever-changing AI field to bring the best to the AI community

Speakers

Arthur Zucker

Core Maintainer, Hugging Face

Arthur is a Core maintainer at Hugging Face, maintaining several critical libraries such as transformers and tokenizers. He is the owner of the text and LLM parts of Hugging Face's open-source toolkits, resulting in the implementations of LLaMa, Mistral, MoEs, etc and torch.compile... Read More →

Thursday September 19, 2024 10:50am - 11:15am PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Intermediate

11:20am PDT

Training MoEs at Scale with PyTorch - Mihir Patel & Brian Chu, Databricks

Thursday September 19, 2024 11:20am - 11:45am PDT

Festival Pavilion - Breakout Room B

Mixture-of-Experts MoE (models) are becoming an increasingly popular architecture choice for large language models (LLMs). In this talk, we describe how to train MoE models with PyTorch. After discussing various performance tradeoffs, we use PyTorch distributed tools like DTensor to build custom parallelism approaches, including expert parallelism via MegaBlocks. We then show how to get near linear scaling to thousands of GPUs, combining PyTorch FSDP and HSDP with our parallelism strategies. We discuss many of the challenges of training at scale, including communication bottlenecks, hardware failures, and networking challenges. We further improve training at scale setups using tools like PyTorch Distributed Checkpointing for rapid saving and loading. We then highlight further optimizations to minimize challenges only present at scale, such as object store failures for large checkpoints.

Speakers

Mihir Patel

Research Engineer, Databricks

Mihir Patel is a Research Engineer at MosaicML / Databricks, where he works on distributed training at scale and serves as the tech lead for Composer, an open-source deep learning training library. His primary focus is on large model training, and he has helped build several open... Read More →

Brian Chu

Research Engineer, Databricks

Brian is a Research Engineer at MosaicML / Databricks, where he contributes to Composer and Foundry, open-source libraries for training LLMs. He has been involved in the DBRX project and products like the Databricks finetuning and pretraining API. Prior to joining Databricks, Brian... Read More →

[PyTorch Conference] Training MoEs at Scale with PyTorch pdf

Thursday September 19, 2024 11:20am - 11:45am PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Intermediate

11:50am PDT

Lightning Talk: Empowering Developers: Tools and Resources for Running Generative AI on Arm CPUs - Pareena Verma, Arm

Thursday September 19, 2024 11:50am - 12:00pm PDT

Festival Pavilion - Breakout Room B

As the demand for accessible and scalable AI solutions grows, leveraging CPUs for generative AI offers significant advantages in cost, energy efficiency and widespread availability. This sessions aims to equip developers with the ecosystem of tools, resources and technical content needed to effectively run generative AI use cases on Arm CPUs. We have launched a range of easily digestible tutorials for developers, part of our Learning Paths on https://learn.arm.com/, which demonstrate how you can easily and efficiently run small and large language models on Arm-based devices. Learn about end-to-end workflows to accelerate PyTorch based sentiment analysis models from Hugging Face on Arm servers with optimizations in Arm Compute Library kernels for fp32 and bfloat16. Use the new KleidiAI library to accelerate LLMs with AI frameworks and build an Android chat app on your Arm mobile device with ExecuTorch, and XNNPACK. Find out about our roadmap for learning content demonstrating the feasibility and successful deployment of generative AI on Arm-based devices. Help us shape the support that we offer developers.

Speakers

Pareena Verma

Principal Solutions Architect, Arm

Pareena is a Principal Solutions Architect at Arm. She has extensive experience working with software developers and SoC architects on numerous Arm based projects involving usage of modeling, ML frameworks, compilers, debuggers and virtual prototyping simulation tools. Pareena holds... Read More →

Thursday September 19, 2024 11:50am - 12:00pm PDT
Festival Pavilion - Breakout Room B

Lightning Talks

Audience Intermediate

12:00pm PDT

Lightning Talk: Optimized PyTorch Inference on aarch64 Linux CPUs - Sunita Nadampalli, Amazon (AWS)

Thursday September 19, 2024 12:00pm - 12:10pm PDT

Festival Pavilion - Breakout Room B

In the last 2 years we've optimized performance of PyTorch on Arm processors. The optimizations have included changes to ATen, C10, MKLDNN operators, GEMM backend, and Torch inductor. In many cases instead of writing our own kernel we integrated the Arm compute library, used fastmath kernels with format types like bf16, implemented operator caching, selected optimal backend based on the input context etc. Through these optimizations we improved performance by over 2x. In this presentation first we will talk about how we went across this process, what those optimizations are, performance numbers for AWS Graviton3 processors for around 75 models, and CI/CD workflow details. Next, we will walk through a sample PyTorch application showing basic usage, how to tune runtime and the resulting speed up. At the end of the presentation attendees will learn about PyTorch performance optimizations on Arm processors, how to use them, and the areas where they can collaborate to further improve PyTorch for aarch64 CPUs.

Speakers

Sunita Nadampalli

Software Development Manager, Amazon/AWS

Sunita Nadampalli is a Software Development Manager at AWS. She leads Graviton software performance optimizations for AI/ML and HPC workloads. She is passionate about open source software development and delivering high-performance and sustainable software solutions with Arm SoCs... Read More →

pytorch conf24 aarch64 Linux sunita nadampalli pdf

Thursday September 19, 2024 12:00pm - 12:10pm PDT
Festival Pavilion - Breakout Room B

Lightning Talks

Audience Any
Slides Attached Yes

12:10pm PDT

Lightning Talk: AOTriton: Ahead of Time Triton Kernel Libraries on ROCm - Jeff Daily, AMD

Thursday September 19, 2024 12:10pm - 12:20pm PDT

Festival Pavilion - Breakout Room B

Scaled dot product attention provides significant acceleration of the transformer layer through fusion of the multihead attention layer. There are several different algorithms to achieve this but tiled attention through scaled dot product attention via Flash Attention is a very popular approach. In PyTorch on the ROCm platform this is currently achieved through ahead of time compiled (AOT) Triton kernels in a linkable archive. AMD’s work to enable and package these kernels is done through AOTriton, which aims to use Triton’s compiler and GPU kernels for faster development. AOTriton maintains an optimized set of tiling sizes and other parameters to provide optimized, pre-compiled Triton kernels. The differences between JIT and AOT are few but are very important. Despite this, prototyping kernels in Triton is much faster than template-based C++ libraries. In this presentation we will go into detail on the interaction layer between PyTorch and AOTriton, the structure of AOTriton and how to add new triton kernels to AOTriton.

Speakers

Jeff Daily

Principal Member of Technical Staff, Advanced Micro Devices

Jeff Daily is the chief architect of the Machine Learning Software Engineering group supporting ML frameworks such as PyTorch and onnxruntime on AMD GPUs. He enjoys delivering open source software to answer the challenges of the rapidly-changing ML landscape. For over five years... Read More →

AOT PyConf Jeff 20240912 pdf

Thursday September 19, 2024 12:10pm - 12:20pm PDT
Festival Pavilion - Breakout Room B

Lightning Talks

Audience Intermediate

2:15pm PDT

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley

Thursday September 19, 2024 2:15pm - 2:40pm PDT

Festival Pavilion - Breakout Room B

We will present vLLM, an open-source high-performance LLM inference engine built on top of PyTorch. Starting as a research project at UC Berkeley, vLLM has been one of the fastest and most popular LLM inference solutions in industry, reaching 20K+ stars and 350+ contributors. In this talk, we will cover how vLLM adopts various LLM inference optimizations and how it supports various AI accelerators such as AMD GPUs, Google TPUs, and AWS Inferentia. Also, we will discuss how vLLM benefits from PyTorch 2 and its ecosystem.

Speakers

Lily Liu

Student, UCB

Lily (Xiaoxuan) Liu is a PhD student at UC Berkeley, working with Professors Ion Stoica and Alvin Cheung. Her research focuses on machine learning systems, particularly optimizing latency for LLM inference and addressing memory bottlenecks in LLM systems. Her recent work explores... Read More →

Woosuk Kwon

PhD Student, UC Berkeley

Woosuk Kwon is a Ph.D. student at UC Berkeley, advised by Prof. Ion Stoica. He is interested in building practical, flexible, and high-performance software systems for emerging applications such as large language models. Recently, he has been developing vLLM, a high-performance open-source... Read More →

Thursday September 19, 2024 2:15pm - 2:40pm PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Intermediate

2:45pm PDT

Torchtitan: Large-Scale LLM Training Using Native PyTorch 3D Parallelism - Wanchao Liang, Meta & Linsong Chu, IBM Research

Thursday September 19, 2024 2:45pm - 3:10pm PDT

Festival Pavilion - Breakout Room B

torchtitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase. We show-cased end to end large scale training features enablement: 1. 3D/4D Parallelism 2. Efficient distributed checkpoint save/load/resharding 3. Many efficient training techniques including Float8, torch.compile, activation checkpoint, etc.

Speakers

Wanchao Liang

Software Engineer, Meta Platforms, Inc.

Software Engineer at Meta, PyTorch team Tech Lead in PyTorch Distributed training. Author of torchtitan, Tensor Parallel and DTensor, a fundamental distributed abstraction to perform distributed computation. Previously worked on the TorchScript compiler, ONNX.

LINSONG CHU

Senior Technical Staff Member, IBM Research

Linsong is a STSM at IBM Research, focusing on FSDP, torch compile and FP8 in the area of pre-training.

torchtitan Large Scale LLM Training Using Native PyTorch 3D Parallelism pdf

Thursday September 19, 2024 2:45pm - 3:10pm PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Intermediate

3:15pm PDT

Slaying OOMs - Mark Saroufim & Jane Xu, Meta

Thursday September 19, 2024 3:15pm - 3:40pm PDT

Festival Pavilion - Breakout Room B

Have you ever hit an OOM (and wished you had more VRAM)? Who hasn't! Hop on the bus with us and feel the road become smoother as we talk about stacking together techniques like FSDP2 + QLoRa + CPU Offloading + Fused ADAM (thanks Intel) + more in PyTorch native. We will give an overview of these techniques as well as the hard edges we solved in their composition. Curious for more? Or...still OOMing? We also plan on discussing our more researchy work on offloading, pagedness, and low precision optimizers.

Speakers

Jane Xu

SWE, Meta

I'm Jane and I work on the PyTorch core library! Tell me your favorite optimizer, complain to me about your latest OOM, teach me about what you’re excited about.

Mark Saroufim

Software Engineer, Meta

Mark Saroufim is a PyTorch Engineer at Meta working on inference, compilers and community.

FINAL Slaying OOMs PTC 2024 pdf

Thursday September 19, 2024 3:15pm - 3:40pm PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Intermediate

4:05pm PDT

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Thursday September 19, 2024 4:05pm - 4:30pm PDT

Festival Pavilion - Breakout Room B

Understanding how to effectively size a production grade LLM deployment requires understanding of the model(s), the compute hardware, quantization and parallelization methods, KV Cache budgets, input and output token length predictions, model adapter management and much more. - Why LLM inference is different to standard deep learning inference - Current and future NVIDIA GPU overview - which GPU(s) for which models and why - Understanding the importance of building inference engines - Deep recap on the attention mechanism along with different types of popular attention mechanisms used in production - Deep dive on KV Cache and managing KV Cache budgets - Parallelism (reducing latency) - mainly tensor parallelism, but data, sequence, pipeline, and expert parallelism will be highlighted - Quantization methods on weights, activations, and KV Cache to reduce engine sizes for more effective GPU utilization - Increasing throughput with inflight batching and other techniques - Detailed performance analysis of LLM deployments looking at Time to first token, inter-token latencies, llm deployment characterizations, and more that can help reduce deployment costs

Speakers

Mark Moyou

Sr. Data Scientist, NVIDIA

Dr. Mark Moyou Senior Data Scientist at NVIDIA working with enterprise clients on AI strategy and deploying machine learning applications to production. He is the host of the Caribbean Tech Pioneers Podcast, The AI Portfolio Podcast and is the Director of the Optimized AI Confere... Read More →

Thursday September 19, 2024 4:05pm - 4:30pm PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Intermediate

4:35pm PDT

Intel GPU in Upstream PyTorch: Expanding GPU Choices and Enhancing Backend Flexibility - Eikan Wang & Min Jean Cho, Intel

Thursday September 19, 2024 4:35pm - 5:00pm PDT

Festival Pavilion - Breakout Room B

The integration of Intel GPU support into PyTorch marks a pivotal enhancement for PyTorch device and runtime. We generalized the PyTorch device and runtime to accommodate streaming devices. The generalization not only facilitates the deployment of PyTorch on ubiquitous hardware but also makes the integration of different HW backends easier. In addition, PyTorch with Intel GPU supports various Intel GPUs from the data center to the client. It enriches and democratizes PyTorch HW ecosystem. Particularly in AIPC scenarios where Intel's integrated and discrete GPUs are prevalent, Pytorch with Intel GPU can deliver promising performance and improved OOB experience in the AIPC domain that can extend PyTorch's applicability significantly.

Speakers

Eikan Wang

AI Frameworks Engineer, Intel

Eikan is a staff engineer from Intel and a DL framework tech lead having full-stack experience in DL, from various AI applications to framework, library, and DL compiler. He is actively optimizing on torch.compile stack for Intel platforms, including optimizing Inductor C++/OpenMP... Read More →

Min Jean Cho

Deep Learning Software Engineer, Intel Corporation

PyTorch Conference 2024 Intel GPU in PyTorch pdf

Thursday September 19, 2024 4:35pm - 5:00pm PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Beginner

5:05pm PDT

Implementing a Custom Torch.Compile Backend - A Case Study - Maanav Dalal & Yulong Wang, Microsoft

Thursday September 19, 2024 5:05pm - 5:30pm PDT

Festival Pavilion - Breakout Room B

This presentation will dive into the development of the ONNXRuntime (ORT) backend for torch.compile. We'll cover the implementation process, starting with a PyTorch 2.0 generated FX graph, highlighting the unique challenges encountered when serving ORT-specific scenarios and how we solved them. Attendees will gain insights into optimizing performance, overcoming integration hurdles, and achieving efficient execution. Whether you're a developer looking to extend PyTorch's capabilities for your own use cases, keen to learn about ONNX Runtime, or interested in backend performance optimization, and the many steps we've taken to get to where we are now, this session promises valuable takeaways and practical knowledge.

Speakers

Yulong Wang

Software Engineer, Microsoft

Maanav Dalal

Program Manager, Microsoft

PM @Microsoft, working on the ONNX Exporter team. I adore learning about consumer tech and experimenting with bleeding edge software. I'm passionate about creating delightful user experiences.

PyTorch Conference 2024 pptx

Thursday September 19, 2024 5:05pm - 5:30pm PDT
Festival Pavilion - Breakout Room B

Breakout Sessions

Audience Intermediate