Mixed Procedures For Stochastic Optimization

This is part of ongoing work with Professor Felisa Vázquez-Abad of Hunter College. This is the preliminary draft submitted for the WODES 2016 conference. The particularly tricky bit this time around was our decision to use a less trivial example scenario: instead of just a deterministic function plus a normally distributed noise term, we chose to construct a more complicated stochastic function. Crucially, since we wanted to speak to the usefulness of gradient estimation, we wanted a function where IPA derivative estimation could be used.

Put another way, let $X(\theta)$ be some random variable controlled by the parameter $\theta$ and let $h$ be some real-valued measurable mapping where $\mathbb{E}\left[h\left(X(\theta)\right)\right]$ is defined for any $\theta\in\Theta$ . How do we define $\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right]$ ?

One way to think about it is by a kind of stretchy analogy with weak derivatives: we are looking for a measurable mapping $\psi$ such that $\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right] = \mathbb{E}\left[\psi\left(h, X(\theta),\theta\right)\right]$ . Put this way, if we know that the random variable $h\left(X(\theta)\right)$ is differentiable and that swapping expectation and differentiation is permitted (ex. through some bounded convergence theorem-like result) then we have an expression for $\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right]$ ; indeed:

\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right] = \mathbb{E}\left[\dfrac{d}{d\theta}h\left(X(\theta)\right)\right] = \mathbb{E}\left[\left(\dfrac{\partial}{\partial\theta}X(\theta)\right) h^\prime\left(X(\theta)\right)\right]

where $h^\prime$ is just the usual derivative of $h$ . The quantity $\left(\dfrac{\partial}{\partial\theta}X(\theta)\right)$ is called the sample path derivative of $X(\theta)$ and this approach for estimating the derivative of an expected value is called infinitesimal perturbation analysis (IPA).

In the paper, we needed a function that not only supported IPA but still had a relatively small curvature so that the non-gradient method we were discussing would converge in reasonable time.

Conference paper

Mixed Procedures For Stochastic Optimization - WODES 2016

Tags: stochastic-optimization optimization