This is part of ongoing work with Professor Felisa Vázquez-Abad of Hunter College. This is the preliminary draft submitted for the WODES 2016 conference. The particularly tricky bit this time around was our decision to use a less trivial example scenario: instead of just a deterministic function plus a normally distributed noise term, we chose to construct a more complicated stochastic function. Crucially, since we wanted to speak to the usefulness of gradient estimation, we wanted a function where IPA derivative estimation could be used.

Put another way, let X(θ)X(\theta) be some random variable controlled by the parameter θ\theta and let hh be some real-valued measurable mapping where E[h(X(θ))]\mathbb{E}\left[h\left(X(\theta)\right)\right] is defined for any θΘ\theta\in\Theta. How do we define ddθE[h(X(θ))]\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right]?

One way to think about it is by a kind of stretchy analogy with weak derivatives: we are looking for a measurable mapping ψ\psi such that ddθE[h(X(θ))]=E[ψ(h,X(θ),θ)]\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right] = \mathbb{E}\left[\psi\left(h, X(\theta),\theta\right)\right]. Put this way, if we know that the random variable h(X(θ))h\left(X(\theta)\right) is differentiable and that swapping expectation and differentiation is permitted (ex. through some bounded convergence theorem-like result) then we have an expression for ddθE[h(X(θ))]\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right]; indeed:

ddθE[h(X(θ))]=E[ddθh(X(θ))]=E[(θX(θ))h(X(θ))]\dfrac{d}{d\theta}\mathbb{E}\left[h\left(X(\theta)\right)\right] = \mathbb{E}\left[\dfrac{d}{d\theta}h\left(X(\theta)\right)\right] = \mathbb{E}\left[\left(\dfrac{\partial}{\partial\theta}X(\theta)\right) h^\prime\left(X(\theta)\right)\right]

where hh^\prime is just the usual derivative of hh. The quantity (θX(θ))\left(\dfrac{\partial}{\partial\theta}X(\theta)\right) is called the sample path derivative of X(θ)X(\theta) and this approach for estimating the derivative of an expected value is called infinitesimal perturbation analysis (IPA).

In the paper, we needed a function that not only supported IPA but still had a relatively small curvature so that the non-gradient method we were discussing would converge in reasonable time.

Conference paper