## CSL Visitors Scholar Lecture: Partially Observed Stochastic Control: Filter Stability, Near Optimality of Finite-Window Policies and their Q-learning Convergence

- Event Type
- Lecture
- Sponsor
- Coordinated Science Lab
- Location
- CSL Auditorium (B02)
- Date
- Oct 25, 2021 3:00 pm
- Speaker
- Professor Serdar YĆ¼ksel, Queen's University, Canada
- Views
- 7
- Originating Calendar
- CSL General Event Calendar
We study approximation results for optimization of partially observed stochastic control models (POMDPs) under finite memory/window policies and present explicit rates of convergence (in memory length). We then present an associated finite-memory Q-learning algorithm and show its convergence to the near-optimal policy corresponding to the finite memory approximation.

Key to our analysis is controlled filter stability and stochastic non-linear observability: Filter stability refers to the correction of an incorrectly initialized filter for a partially observed dynamical system with increasing measurements. This problem has been studied extensively in the control-free context, and except for the classical Kalman Filter setup, few studies exist for the controlled case. We present explicit conditions for filter stability under a new definition of stochastic observability, and also a rate of convergence result for stability via a joint Dobrushin coefficient analysis involving both measurement and transition kernels.

These stability results are then applied to POMDPs to arrive at approximate MDPs whose solutions are near optimal: In POMDPs, existence of optimal policies have in general been established via reducing the original partially observed stochastic control problem to a fully observed one on the belief space. However, computing a near-optimal policy for this fully observed model is challenging even under approximate finite models. We present an alternative reduction tailored to our approximation analysis via filter stability and arrive at an approximate finite MDP model. Finally, we establish the convergence of an associated Q learning algorithm for control policies using such a finite history of past observations and control actions (by viewing the finite window as a 'state') and show near optimality of such limit Q functions. As a corollary, this analysis establishes near optimality of classical Q-learning for continuous state space MDPs (by lifting them to POMDPs with approximating quantizers viewed as measurement kernels) under mild continuity conditions.