Fixing models by fixing datasets: A note from Feature Store Summit 2021

Yik San Chan, notesml-sys
Back

This article is a curation of a talk given by Atindriyo Sanyal on Feature Store Summit 2021. Atindriyo is the co-founder and president of Galileo. He was previously leading Machine Learning Data Quality and Observability efforts for Uber and is the co-architect of the world’s first Feature Store - Michelangelo Palette.

Introduction

ML has a simple equation: prediction = model(data). To get better predictions, we need both better models as well as better data. While model architectures are maturing quickly, tools to ensure better data quality, however, hasn't been in place yet. In Atindriyo's talk, he introduced common data quality challenges and how to tackle them.

Data quality challenges

Here are the data quality challenges faced across the data lifecycle.

Find and curate datasets

The goal here is to find and curate data that are high value, representative, and get us a maximum lift with minimum data.

To find such datasets, we could filter by feature redundancy and relevance to labels, see Uber's practice.

To curate the datasets, there are at least 2 ways.

Identify problems in datasets

Once we get the datasets, we want to identify problems in them. In this part, Atindriyo didn't define each problem, and I have little context about the topic, so honestly what I am doing here is just transcribing what he said.

Dataset problems include regions of model underperformance, robustness across sub-populations, similar and dissimilar examples, noisy data and labels. What we can do to eliminate these problems?

Detect data quality problems in production

Problems include train-serve skew, feature and label drift, etc.

My take

Data quality in ML is not discussed enough given how fundamental it is to make a model really useful. I cannot wait to see the launch of Galileo, and if you know any other startups working on this problem, please kindly let me know!

Any feedback? Comment on Twitter!


© Yik San Chan. Built with Vercel and Nextra.