This is part of ongoing work with Professor Felisa Vázquez-Abad of Hunter College. This is the preliminary draft submitted for the WODES 2016 conference. The particularly tricky bit this time around was our decision to use a less trivial example scenario: instead of just a deterministic function plus a normally distributed noise term, we chose to construct a more complicated stochastic function. Crucially, since we wanted to speak to the usefulness of gradient estimation, we wanted a function where IPA derivative estimation could be used.

Put another way, let $$X(\theta)$$ be some random variable controlled by the parameter $$\theta$$ and let $$h$$ be some real-valued measurable mapping where $$\mathbb{E}\left[h\left(X(\theta)\right)\right]$$ is defined for any $$\theta\in\Theta$$. How do we define $$\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right]$$?

One way to think about it is by a kind of stretchy analogy with weak derivatives: we are looking for a measurable mapping $$\psi$$ such that $$\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right] = \mathbb{E}\left[\psi\left(h, X(\theta),\theta\right)\right]$$. Put this way, if we know that the random variable $$h\left(X(\theta)\right)$$ is differentiable and that swapping expectation and differentiation is permitted (ex. through some bounded convergence theorem-like result) then we have an expression for $$\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right]$$; indeed:

$$\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right] = \mathbb{E}\left[\dfrac{d}{d\theta}h\left(X(\theta)\right)\right] = \mathbb{E}\left[\left(\dfrac{\partial}{\partial\theta}X(\theta)\right) h^\prime\left(X(\theta)\right)\right]$$

where $$h^\prime$$ is just the usual derivative of $$h$$. The quantity $$\left(\dfrac{\partial}{\partial\theta}X(\theta)\right)$$ is called the sample path derivative of $$X(\theta)$$ and this approach for estimating the derivative of an expected value is called infinitesimal perturbation analysis (IPA).

In the paper, we needed a function that not only supported IPA but still had a relatively small curvature so that the non-gradient method we were discussing would converge in reasonable time.

Conference paper