NeurIPS2023
PAC Learning Linear Thresholds from Label Proportions
Anand Brahmbhatt, Rishi Saket, Aravindan Raghuveer
11 citations
Abstract
Learning from label proportions (LLP) is a generalization of supervised learning in which the training data is available as sets or bags of feature-vectors (instances) along with the average instance-label of each bag. The goal is to train a good instance classifier. While most previous works on LLP have focused on training models on such training data, computational learnability of LLP was only recently explored by [25, 26] who showed worst case intractability of properly learning linear threshold functions (LTFs) from label proportions. However, their work did not rule out efficient algorithms for this problem on natural distributions. In this work we show that it is indeed possible to efficiently learn LTFs using LTFs when given access to random bags of some label proportion in which feature-vectors are, conditioned on their labels, independently sampled from a Gaussian distribution N(µ, Σ). Our work shows that a certain matrix -formed using covariances of the differences of feature-vectors sampled from the bags with and without replacement -necessarily has its principal component, after a transformation, in the direction of the normal vector of the LTF. Our algorithm estimates the means and covariance matrices using subgaussian concentration bounds which we show can be applied to efficiently sample bags for approximating the normal direction. Using this in conjunction with novel generalization error bounds in the bag setting, we show that a low error hypothesis LTF can be identified. For some special cases of the N(0, I) distribution we provide a simpler mean estimation based algorithm. We include an experimental evaluation of our learning algorithms along with a comparison with those of [25, 26] and random LTFs, demonstrating the effectiveness of our techniques. * -equal contribution a distribution on (x, f (x)), a hypothesis h ∈ H of arbitrarily high accuracy on that distribution, for any unknown f ∈ C. If H = C we say that C is properly learnable, for e.g. LTFs are known to be properly learnable using linear programming ([3]). This notion can be extended to the LLP setting -which for brevity we call PAC-LLP -as follows: distribution D is over bags and their label proportions (B, σ(B, f )) where B = x 1 , . . . , x q is a bag of feature vectors and σ(B, f ) = Avg f (x) | x ∈ B. A bag (B, σ(B, f )) is said to be satisfied by h iff σ(B, h) = σ(B, f ), and the accuracy of h is the fraction of bags satisfied by it. With the above notion of PAC-LLP, [25, 26] studied the learnability of LTFs and rather disturbingly showed that for any constant ε > 0 it is NP-hard to PAC-LLP learn an LTF using an LTF which satisfies (1/q + ε)fraction of the bags when all bags are of size at most q. This is in contrast to the supervised learning (i.e, with unit-sized bags) in which an LTF can be efficiently learnt by an LTF using linear programming. Their work also gave a convex programming algorithms to find an LTFs satisfying (2/5)-fraction of bags of size ≤ 2, (1/12)-fraction of bags of size ≤ 3. While these results show that PAC-LLP learning LTFs using LTFs is intractable on hard bag distributions, they raise the question of whether the problem is tractable on natural distributions that may arise out of real world scenarios. We answer the above question in the affirmative when the feature-vectors are distributed according to some (unknown) Gaussian distribution D = N(µ, Σ) in d-dimensions. Gaussian distributions are ubiquitous in machine learning and in many applications the input data distribution is modeled as multivariate Gaussians, and several previous works [8, 30] have studied learnability in Gaussian distributions. An unkown target LTF is given by f (x) := pos r T * x + c * where r * 2 = 1. Let D a be the distribution of x ← D conditioned on f (x) = a, for a ∈ 0, 1. Using this we formalize the notion of a distribution O on bags of size q and average label k/q: a random bag B sampled from O consists of k iid samples from D 1 and (qk) iid samples from D 0 . The case of k ∈ 0, q is uninteresting as all instances in such bags are either labeled 0 or 1 and traditional PAC-learning for LTFs can be employed directly. Unlike [25, 26] our objective is to directly maximize the instance-level level accuracy on D. With this setup we informally describe our main result. Our PAC-LLP LTF Learner (Informal): Assuming mild conditions on Σ, µ and c * , for any q, k ∈ Z + s.t. 1 ≤ k ≤ q -1 and ε, δ > 0, there is an algorithm that samples at most m bags from O and runs in time O(t + m) and with probability 1δ produces an LTF h s.t. where t, m are fixed polynomials in d, q, (1/ε), log(1/δ). We also obtain a more efficient algorithm when k = q/2, µ = 0, c * = 0 and Σ = I. The ambiguity in the case of k = q/2 is inherent since bags of label proportion 1/2 consistent with an LTF f (x) are also consistent with (1f (x)). Remark 1.1 (Mixtures of (q, k)) The training data could consist of bags of different sizes and label proportions