Are You Collecting Data the Right Way? The Value of Budget, Segmentation, and Expertise.
In practice, large institutions from humanitarian agencies to GenAI platforms run complex data pipelines where a single dataset serves many use cases. Optimizing data collection for one use can quietly degrade its reliability for others. This raises a fundamental question: how should institutions design data-collection operations so that the data they gather is reliably useful across use cases?
In this talk, we develop a reliability-aware framework for adaptive data collection. We study the fundamental problem of estimating group-level means under a fixed sampling budget. The main theoretical contribution is a sharp characterization of what makes data collection intrinsically hard or easy. We derive tight, instance-dependent regret bounds that identify a “reliability frontier,” showing how achievable reliability is governed jointly by budget, how uncertainty is distributed across groups, and how informative the chosen statistical model is.
We apply this framework to real FAO “Data in Emergencies” microdata from Afghanistan. Relative to current uniform sampling practice, the adaptive design reduces required interviews by approximately 75%, while coming within 10% within the unobservable optimality benchmark.

