STOC2023

What Makes a Good Fisherman? Linear Regression under Self-Selection Bias

Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Andrew Ilyas, Manolis Zampetakis

4 citations

Abstract

In the classical setting of self-selection, the goal is to learn k models simultaneously, from observations (x(i), y(i)) where y(i) is the output of one of k underlying models on input x(i). In contrast to mixture models, where we observe the output of a randomly selected model (and therefore the selection of which model is observed is exogenous), in self-selection models the observed model depends on the realized outputs of the underlying models themselves, as determined by some known selection criterion (e.g., we might observe the highest output, the smallest output, or the median output of the k models), and is thus endogenous. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics (going back to the works of Roy, Gronau, Lewis, Heckman, and others) and many applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium.