NeurIPS2020

Continuous Meta-Learning without Tasks

James Harrison, Apoorva Sharma, Chelsea Finn, Marco Pavone

被引用 83 次

摘要

Meta-learning is a promising strategy for learning to efficiently learn using data gathered from a distribution of tasks. However, the meta-learning literature thus far has focused on the task segmented setting, where at train-time, offline data is assumed to be split according to the underlying task, and at test-time, the algorithms are optimized to learn in a single task. In this work, we enable the application of generic meta-learning algorithms to settings where this task segmentation is unavailable, such as continual online learning with unsegmented time series data. We present meta-learning via online changepoint analysis (MOCA), an approach which augments a meta-learning algorithm with a differentiable Bayesian changepoint detection scheme. The framework allows both training and testing directly on time series data without segmenting it into discrete tasks. We demonstrate the utility of this approach on three nonlinear meta-regression benchmarks as well as two meta-image-classification benchmarks. Preliminaries Meta-Learning. The core idea of meta-learning is to directly optimize the few-shot learning performance of a machine learning model over a distribution of learning tasks, such that this learning performance generalizes to other tasks from this distribution. A meta-learning method consists of two phases: meta-training and online adaptation. Let θ be the parameters of this model learned in meta-training. During online adaptation, the model uses context data D t = (x 1:t , y 1:t ) from within one task to compute statistics η t = f θ (D t ), where f is a function parameterized by θ. For example, in MAML [10], the statistics are the neural network weights after gradient updates computed using D t . For recurrent network-based meta-learning algorithms, these statistics correspond to the hidden state of the network. For a simple nearest-neighbors model, η may simply be the context data. The model then performs predictions by using these statistics to define a conditional distribution on y given new inputs x, which we write y | x, D t ∼ p θ (y | x, η t ). Adopting a Bayesian perspective, we refer to p θ (y | x, η t ) as the posterior predictive distribution. The performance of this model on this task can be evaluated through the log-likelihood of task data under this posterior predictive distribution L Meta-learning algorithms, broadly, aim to optimize the parameters θ such that the model performs well across a distribution of tasks, min θ E Ti∼p(T ) [E Dt∼Ti [L(D t , θ)]] . Across most meta-learning algorithms, both the update rule f θ (•) and the prediction function are chosen to be differentiable operations, such that the parameters can be optimized via stochastic gradient descent. Given a dataset