AAAI2024

Abstract and Explore: A Novel Behavioral Metric with Cyclic Dynamics in Reinforcement Learning

Anjie Zhu, Peng-Fei Zhang, Ruihong Qiu, Zetao Zheng, Zi Huang, Jie Shao

1 citation

Abstract

The twenty years since the publication of the first edition of this book have seen tremendous progress in artificial intelligence, propelled in large part by advances in machine learning, including advances in reinforcement learning. Although the impressive computational power that became available is responsible for some of these advances, new developments in theory and algorithms have been driving forces as well. In the face of this progress, a second edition of our 1998 book was long overdue, and we finally began the project in 2012. Our goal for the second edition was the same as our goal for the first: to provide a clear and simple account of the key ideas and algorithms of reinforcement learning that is accessible to readers in all the related disciplines. The edition remains an introduction, and we retain a focus on core, online learning algorithms. This edition includes some new topics that rose to importance over the intervening years, and we expanded coverage of topics that we now understand better. But we made no attempt to provide comprehensive coverage of the field, which has exploded in many di↵erent directions. We apologize for having to leave out all but a handful of these contributions. As in the first edition, we chose not to produce a rigorous formal treatment of reinforcement learning, or to formulate it in the most general terms. However, our deeper understanding of some topics since the first edition required a bit more mathematics to explain; we have set o↵ the more mathematical parts in shaded boxes that the nonmathematically-inclined may choose to skip. We also use a slightly di↵erent notation than was used in the first edition. In teaching, we have found that the new notation helps to address some common points of confusion. It emphasizes the di↵erence between random variables, denoted with capital letters, and their instantiations, denoted in lower case. For example, the state, action, and reward at time step t are denoted S t , A t , and R t , while their possible values might be denoted s, a, and r. Along with this, it is natural to use lower case for value functions (e.g., v ⇡ ) and restrict capitals to their tabular estimates (e.g., Q t (s, a)). Approximate value functions are deterministic functions of random parameters and are thus also in lower case (e.g., v(s,w t ) ⇡ v ⇡ (s)). Vectors, such as the weight vector w t (formerly ✓ t ) and the feature vector x t (formerly t ), are bold and written in lowercase even if they are random variables. Uppercase bold is reserved for matrices. In the first edition we used special notations, P a ss 0 and R a ss 0 , for the transition probabilities and expected rewards. One weakness of that notation is that it still did not fully characterize the dynamics of the rewards, giving only their expectations, which is su cient for dynamic programming but not for reinforcement learning. Another weakness xiii xiv Preface to the Second Edition is the excess of subscripts and superscripts. In this edition we use the explicit notation of p(s 0 , r|s, a) for the joint probability for the next state and reward given the current state and action. All the changes in notation are summarized in a table on page xix. The second edition is significantly expanded, and its top-level organization has been changed. After the introductory first chapter, the second edition is divided into three new parts. The first part (Chapters 2-8) treats as much of reinforcement learning as possible without going beyond the tabular case for which exact solutions can be found. We cover both learning and planning methods for the tabular case, as well as their unification in n-step methods and in Dyna. Many algorithms presented in this part are new to the second edition, including UCB, Expected Sarsa, Double learning, tree-backup, Q( ), RTDP, and MCTS. Doing the tabular case first, and thoroughly, enables core ideas to be developed in the simplest possible setting. The second part of the book (Chapters 9-13) is then devoted to extending the ideas to function approximation. It has new sections on artificial neural networks, the fourier basis, LSTD, kernel-based methods, Gradient-TD and Emphatic-TD methods, average-reward methods, true online TD( ), and policygradient methods. The second edition significantly expands the treatment of o↵-policy learning, first for the tabular case in Chapters 5-7, then with function approximation in Chapters 11 and 12. Another change is that the second edition separates the forward-view idea of n-step bootstrapping (now treated more fully in Chapter 7) from the backwardview idea of eligibility traces (now treated independently in Chapter 12). The third part of the book has large new chapters on reinforcement learning's relationships to psychology (Chapter 14) and neuroscience (Chapter 15), as well as an updated case-studies chapter including Atari game playing, Watson's wagering strategy, and the Go playing programs AlphaGo and AlphaGo Zero (Chapter