Introduction to Reinforcement Learning (Spring 2021)

This is an introductory course on reinforcement learning (RL) and sequential decision-making under uncertainty with an emphasis on understanding the theoretical foundation. We study how dynamic programming methods such as value and policy iteration can be used to solve sequential decision-making problems with known models, and how those approaches can be extended in order to solve reinforcement learning problems, where the model is unknown. Other topics include, but not limited to, function approximation in RL, policy gradient methods, model-based RL, and balancing the exploration-exploitation trade-off. The course will be delivered as a mix of lectures and reading of classical and recent papers assigned to students. As the emphasis is on understanding the foundation, you should expect to go through mathematical detail and proofs. Required background for this course includes being comfortable with probability theory and statistics, calculus, linear algebra, optimization, and (supervised) machine learning.

Announcements:

Teaching Staff

Instructor: Amir-massoud Farahmand
- Email: csc2547-2021-01@cs.toronto.edu
TAs: Romina Abachi, Ehsan Mehralian, Michael Zhang.

Time and Location:

Lecture: Tuesdays, 1-3PM
Instructor’s Office Hour: Thursdays, 12-1PM (unless specified otherwise)
TA’s Office Hours: Varies depending on each assignment. We use the same Zoom link for all meetings. It is distributed through Quercus.

Reading

The course material is based on Lecture Notes on Reinforcement Learning. This is a live document that will change as we progress through the course. If you find a typo or mistake, please let me know (csc2547-2021-01@cs.toronto.edu). I collect the list of reported ones here.

Some other useful textbooks (incomplete list):

Richard Sutton and Andrew Barto, Introduction to Reinforcement Learning (2nd edition), 2018.
Csaba Szepesvári, Algorithms of Reinforcement Learning, 2010.
Dimitri Bertsekas, Abstract Dynamic Programming (2nd Edition)
Dimitri Bertsekas and John Tsitsiklis, Neuro-Dynamic Programming, 1996.

Lectures

This is a tentative schedule, and may change.

Note on videos: The videos will be publicly available on YouTube. If you don’t feel comfortable being recorded, make sure to turn off your camera when asking questions (though I really prefer to see all your faces when presenting a lecture, so it doesn’t feel that I am talking to void!).

Week (date)	Topics	Lectures	Reading
1 (Jan 12)	Introduction to Reinforcement Learning	slides video	Chapter 1 of LNRL
2 (Jan 19)	Structural Properties of Markov Decision Processes (Part I)	slides video (Part I)	Chapter 2 LNRL
3 (Jan 26)	Structural Properties of Markov Decision Processes (Part II)	slides video (Part II)	Chapter 2 LNRL
4 (Feb 2)	Planning with a Known Model	slides video	Chapter 3 LNRL
5 (Feb 9)	Learning from a Stream of Data (Part I)	slides video (Part I)	Chapter 4 LNRL
6 (Feb 25)	Learning from a Stream of Data (Part II) Value Function Approximation (Part I)	video (Part II of Learning from Stream) slides (VFA) video (Part I of VFA)	Chapter 5 LNRL
7 (Mar 2)	Value Function Approximation (Part II)	video (Part II)	Chapter 5 LNRL
8 (Mar 9)	Value Function Approximation (Part III)	video (Part III)	Chapter 5 LNRL
9 (Mar 16)	Value Function Approximation (Part IV)	video (Part IV)	Chapter 5 LNRL
10 (Mar 25)	Policy Search Methods (Part I)	slides video (Part I)	Chapter 6 LNRL
11 (Mar 30)	Policy Search Methods (Part II)	video (Part II)	Chapter 6 LNRL
12 (Apr 1)	Model-based RL	slides video	S&B, Chapter 8 (especially Sec 8.1-4) VAML, IterVAML, and PAML papers (cited within slides – optional)
13 (April 6)	Presentations

Assignments and Courswork

These are the main components of the course. The details are described below. You need to use MarkUs to submit your solutions.

Three homework assignments (45%): 15% each
Take-home Test (15%). March 28; due date: April 2
Project (30%)
Reading Assignments (10%)
Bonus (5%): Finding typos in the lecture notes, active class participation, evaluating the class, etc.

Homework Assignments

There will be three homework assignments. The detail will be posted.

This is a tentative schedule of the homework assignments. Most of them will be released on a Monday (or late Sunday evening) and will be due on a Monday in two weeks. The deadline is 16:59.

Homework #	Out	Due	Materials	TA Office Hours
Homework 1	Feb 22	March 8	Questions Code	Feb 26 (Fri) and March 3 (Wed), both 12-1PM
Homework 2	March 12	March 26	Questions Code	Mar 17 (Wed), 2-3PM and Mar 23 (Tue), 12-1PM
Homework 3	April 1	April 19	Questions Code	Apr 7 (Wed) and Apr 14 (Wed), 10-11AM

Research Project

Read the instruction here!

Proposal (5%): March 12 14 (updated) (EOD, Toronto time)
Presentation (5%): April 6
Project Report (20%): April 16 20 (Tuesday) (updated) (Note: The deadline is 5:00PM sharp, Toronto time. There is no grace period for this extended deadline.)

Reading Assignments

The following papers are a combination of seminal papers in RL, topics that we didn’t cover in lectures, or active research areas. You need to choose five (5) papers out of them, depending on your interest. Please read them and try to understand them as much as possible. It is not important that you completely understand a paper or go into detail of the proofs (if there is any), but you should put some effort into it.

After reading each paper:

You should summarize it in a short paragraph (100-200 words). Highlight the main points of the paper. Ignore the less interesting aspects.
Try to come up with one or two suggestions on how the method/idea described in the paper can be used or extended.

These five assignments contribute 10% to your final mark. The reading assignments are only lightly evaluated. You should submit your summaries, all in one PDF file, before April 12th (Monday) at 5PM.

We will post the papers as the course progresses. Please read and summarize them as we post them, so you won’t have a large workload close to the end of the semester.

Note: that this is an incomplete and biased list. I have many favourite papers that are not included in this short list.

Sutton, Precup, Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning,” Journal of Artificial Intelligence, 1999. PDF [AltMDP. Introduces the Options framework, as one of the major approaches to Hierarchical RL.]
Bellemare, Dabney, Munos, “A Distributional Perspective on Reinforcement Learning,” ICML, 2017. PDF [AltMDP. How can we learn the distribution of returns instead of their expectation?]
Barreto, Dabney, Munos, Hunt, Schaul, van Hasselt, Silver, “Successor Features for Transfer in Reinforcement Learning,” NeurIPS, 2017. PDF [Transfer in RL]
Silver, Huang, Maddison, Guez, Sifre, et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, 2016. PDF [Planning. This introduces alphaGo, which beat the human champion in the game of Go.]
Mnih, Kavukcuoglu, Silver, Rusu, Veness, et al., “Human Level Control Through Deep Reinforcement Learning,” Nature, 2015. PDF [VFA. Deep Q-Networks learns how to play Atari.]
Dann, Neumann, Peters, “Policy Evaluation with Temporal Differences: A Survey and Comparison,” JMLR, 2014. PDF [VFA. Survey on TD methods.]
Farahmand, Ghavamzadeh, Szepesvari, Mannor, “Regularized Policy Iteration with Nonparametric Function Spaces,” JMLR, 2015. PDF [VFA. Regularization in RL]
Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” ICML, 2016. PDF [PS]
Kakade, “A Natural Policy Gradient,” NIPS, 2001. PDF [PS]
Schulman, Levine, Moritz, Jordan, Abbeel, “Trust region policy optimization,” ICML, 2015. PDF [PS]
Haarnoja, Zhou, Abbeel, Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” ICML, 2018. PDF [PS]
Deisenroth, Fox, Rasmussen, “Gaussian Processes for Data-Efficient Learning in Robotics and Control,” PAMI, 2014. PDF [PS & MBRL]
Henderson, Islam, Bachman, Pineau, Precup, Meger, “Deep Reinforcement Learning that Matters,” AAAI, 2018. PDF [Criticism on reproducibility of deep RL research]
Auer, Cesa-Bianchi, Fischer, “Finite-time Analysis of the Multiarmed Bandit Problem,” MLJ, 2002. PDF [Exploration-exploitation tradeoff]

Legend:

AltMDP: Alternative models of MDP or similar models.
VFA: Value Function Approximation
PS: Policy Search
MBRL: Model-based RL