Title | : | The Magic 'M' in LLM is Data |
Speaker | : | Dr. Ritesh Sarkhel (Applied Scientist Amazon (Seattle, United States)) |
Details | : | Tue, 5 Nov, 2024 2:00 PM @ SSB 334 |
Abstract: | : | Being able to leverage more data than ever has been one of the key factors behind recent breakthroughs in computer vision, speech recognition, and natural language understanding technologies. The unprecedented leverage provided by today’s machine learning (ML) models, however, is bottlenecked by the availability of high-quality data needed to train and benchmark these models. This is particularly true for the current generation of models e.g., Large Language Models (LLM). Identifying the right data sources, extracting and annotating data from them, choosing the right data-mix for training etc. has a far-reaching impact on our long-term goal of democratizing powerful ML models (e.g. LLM) beyond a handful of western countries. It is usually the responsibility of a data pipeline to perform these data operations (e.g., clustering, outlier detection, transformation) and prepare a high-quality training subset from a large-scale corpus. Scaling up existing data operators to identify ‘good’ training samples and discard ‘bad’ ones, however, is a difficult task due to the diversity of modern data sources. In this talk, I will describe some of the challenges of building a robust data pipeline for enterprise-scale ML models. I will categorize these challenges and describe the key lessons that we have learned from deploying ML models at Amazon-scale. Finally, I will introduce some of our recent works to address these challenges in a cost-efficient and generalizable way.
Bio: Ritesh is an Applied Scientist in Amazon. He is a founding member of the team responsible for developing Rufus -- the world's first trained-from-scratch, shopping-focused large language model to serve real customer traffic at Amazon-scale. He obtained his Ph.D. from The Ohio State University in 2022. Ritesh's research focus lies at the intersection of MLOps and Multimodal Data Mining. He has published multiple papers on related topics in peer-reviewed journals and conferences including NeuRIPS, SIGMOD, VLDB, and IJCAI. His work has been recognized by several patents granted by IPO and USPTO. |