WWW2025

Linear-Time Algorithms for Representative Subset Selection From Data Streams

Shuang Cui, Kai Han, Jing Tang

2 citations

Abstract

Representative subset selection from data streams is a critical problem with wide-ranging applications in web data mining and machine learning, such as social media marketing, big data summarization, and recommendation systems. This problem is often framed as maximizing a monotone submodular function subject to a knapsack constraint, where each data element in the stream has an associated cost, and the goal is to select elements within a budget B to maximize revenue. However, existing algorithms typically rely on restrictive assumptions about the costs of data elements, and their performance bounds heavily depend on the budget B. As a result, these algorithms are only effective in limited scenarios and have super-linear time complexity, making them unsuitable for large-scale data streams. In this paper, we introduce the first linear-time streaming algorithms for this problem, without any assumptions on the data stream, while also minimizing memory usage. Specifically, our single-pass streaming algorithm achieves an approximation ratio of 1/8-ε under O (n) time complexity and O(k log 1/ε) space complexity, where k is the largest cardinality of any feasible solution. Our multi-pass streaming algorithm improves this to a (1/2-ε)-approximation using only three passes over the stream, with O (n/ε log 1/ε) time complexity and O(k/ε log 1/ε) space complexity. Extensive experiments across various applications related to web data mining and social media marketing demonstrate the superiority of our algorithms in terms of both effectiveness and efficiency.