CVPR2020

Advisable Learning for Self-Driving Vehicles by Internalizing Observation-to-Action Rules

Jinkyu Kim, Suhong Moon, Anna Rohrbach, Trevor Darrell, John F. Canny

Abstract

Professor John Canny, Chair Deep neural perception and control networks are likely to be a key component of self-driving vehicles. These models need to be explainable -they should provide easy-to-interpret rationales for their behavior -so that passengers, insurance companies, law enforcement, developers, etc., can understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller's output, namely rationalizations. Our work has focused on the challenge of generating introspective explanations of deep models for self-driving vehicles. In Chapter 3, we begin by exploring the use of visual explanations. These explanations take the form of real-time highlighted regions of an image that causally influence the network's output (steering control). In the first stage, we use a visual attention model to train a convolution network end-to-end from images to steering angle. The attention model highlights image regions that potentially influence the network's output. Some of these are true influences, but some are spurious. We then apply a causal filtering step to determine which input regions actually influence the output. This produces more succinct visual explanations and more accurately exposes the network's behavior. In Chapter 4, we add an attention-based video-to-text model to produce textual explanations of model actions, e.g. "the car slows down because the road is wet". The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong-and weak-alignment. These explainable systems represent an externalization of tacit knowledge. The network's opaque reasoning is simplified to a situation-specific dependence on a visible object in the image. This makes them brittle and potentially unsafe in situations that do not match training data. In Chapter 5, we propose to address this issue by augmenting training data with natural language advice from a human. Advice includes guidance about what to do and where to attend. We present the first step toward advice-giving, where we train an end-to-end I would like to thank my advisor John Canny for his patience, support, and motivation through my PhD studies. With his guidance, I have been able to discover a research field I am truly passionate about. Additionally, I would like to thank Anca Dragan for serving my qualification committee, Trevor Darrell and David Whitney for serving on both my qualification and dissertation committees. I would also like to thank all those who have mentored me through my PhD, in particular