Mixed Procedures For Stochastic Optimization - WODES 2016
This is part of ongoing work with Professor Felisa Vázquez-Abad of Hunter College. This is the preliminary draft submitted for the WODES 2016 conference. The particularly tricky bit this time around was our decision to use a less trivial example scenario: instead of just a deterministic function plus a normally distributed noise term, we chose to construct a more complicated stochastic function. Crucially, since we wanted to speak to the usefulness of gradient estimation, we wanted a function where IPA derivative estimation could be used.
Put another way, let \(X(\theta)\) be some random variable controlled by the parameter \(\theta\) and let \(h\) be some real-valued measurable mapping where \(\mathbb{E}\left[h\left(X(\theta)\right)\right]\) is defined for any \(\theta\in\Theta\). How do we define \(\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right]\)?
One way to think about it is by a kind of stretchy analogy with weak derivatives: we are looking for a measurable mapping \(\psi\) such that \(\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right] = \mathbb{E}\left[\psi\left(h, X(\theta),\theta\right)\right]\). Put this way, if we know that the random variable \(h\left(X(\theta)\right)\) is differentiable and that swapping expectation and differentiation is permitted (ex. through some bounded convergence theorem-like result) then we have an expression for \(\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right]\); indeed:
\(\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right] = \mathbb{E}\left[\dfrac{d}{d\theta}h\left(X(\theta)\right)\right] = \mathbb{E}\left[\left(\dfrac{\partial}{\partial\theta}X(\theta)\right) h^\prime\left(X(\theta)\right)\right]\)
where \(h^\prime\) is just the usual derivative of \(h\). The quantity \(\left(\dfrac{\partial}{\partial\theta}X(\theta)\right)\) is called the sample path derivative of \(X(\theta)\) and this approach for estimating the derivative of an expected value is called infinitesimal perturbation analysis (IPA).
In the paper, we needed a function that not only supported IPA but still had a relatively small curvature so that the non-gradient method we were discussing would converge in reasonable time.