WWW2025

The Cost of Balanced Training-Data Production in an Online Data Market

Augustin Chaintreau, Roland Maio, Juba Ziani

被引用 1 次

摘要

Many ethical issues in machine learning are connected to the training data. Online data markets are an important source of training data, facilitating both production and distribution. Recently, a trend has emerged of for-profit "ethical" participants in online data markets. This trend raises a fascinating question: Can online data markets sustainably and efficiently address ethical issues in the broader machine-learning economy? In this work, we study this question in a stylized model of an online data market. We investigate the effects of intervening in the data market to achieve balanced training-data production. The model reveals the crucial role of market conditions. In small and emerging markets, an intervention can drive the data producers out of the market, so that the cost of fairness is maximal. Yet, in large and established markets, the cost of fairness can vanish (as a fraction of overall welfare) as the market grows. Our results suggest that "ethical" online data markets can be economically feasible under favorable market conditions, and motivate more models to consider the role of data production and distribution in mediating the impacts of ethical interventions. CCS CONCEPTS • Theory of computation → Market equilibria; • Information systems → E-commerce infrastructure; World Wide Web; • Computing methodologies → Machine learning; Model development and analysis; • Social and professional topics → Computing / technology policy. Summary of contributions. We investigate this question in a stylized model of an online data market. Our main contributions are as follow. • We revisit a well-known model of Agarwal et al. [2] . Our modeling contribution is to formulate a specialized variant