Name: Understanding the LLM Inference Workload - Mark Moyou, NVIDIA
Start: 2024-09-19T16:35:00-0700
End: 2024-09-19T17:00:00-0700

September 18-19, 2024
San Francisco, California
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Pacific Daylight Time (UTC-7). To see the schedule in your preferred timezone, please select from the drop-down located at the bottom of the menu to the right.

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Thursday September 19, 2024 4:35pm - 5:00pm PDT

Room B

Understanding how to effectively size a production grade LLM deployment requires understanding of the model(s), the compute hardware, quantization and parallelization methods, KV Cache budgets, input and output token length predictions, model adapter management and much more. - Why LLM inference is different to standard deep learning inference - Current and future NVIDIA GPU overview - which GPU(s) for which models and why - Understanding the importance of building inference engines - Deep recap on the attention mechanism along with different types of popular attention mechanisms used in production - Deep dive on KV Cache and managing KV Cache budgets - Parallelism (reducing latency) - mainly tensor parallelism, but data, sequence, pipeline, and expert parallelism will be highlighted - Quantization methods on weights, activations, and KV Cache to reduce engine sizes for more effective GPU utilization - Increasing throughput with inflight batching and other techniques - Detailed performance analysis of LLM deployments looking at Time to first token, inter-token latencies, llm deployment characterizations, and more that can help reduce deployment costs

Speakers

Mark Moyou

Sr. Data Scientist, NVIDIA

Dr. Mark Moyou Senior Data Scientist at NVIDIA working with enterprise clients on AI strategy and deploying machine learning applications to production. He is the host of the Caribbean Tech Pioneers Podcast, The AI Portfolio Podcast and is the Director of the Optimized AI Confere... Read More →

Thursday September 19, 2024 4:35pm - 5:00pm PDT
Room B

Breakout Sessions

Audience Intermediate

PyTorch Conference 2024

Mark Moyou

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!