2022-02-24 HMC-NUTS - Sam Abbott's notes

## Summary I recently gave a talk for the CMMID LSHTM statistical inference theme on ["Stan - without the scary parts"](https://samabbott.co.uk/presentations/2022/stan-an-introduction-without-the-scary-parts.pdf) and realised that I really needed a bit of a refresher on the scary parts. Below are some notes I took whilst putting together the slides and links to further resources. ## Notes on individual resources - Tutorial and code demo: https://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html ### Hamiltonian Monte Carlo and NUTS By: Jeffrey W. Miller Slides: https://jwmi.github.io/BMB/18-Hamiltonian-Monte-Carlo-and-NUTS.pdf #### Why - Traditional MCMC algorithms have highly correlated steps which leads to inefficient sampling. #### What - HMC employs a dynamical systems approach to more quickly traverse the parameter space and hence improve MCMC mixing. - Requires us to be able to compute the gradient of the log target density. Similar to gradient based optimisation methods. - Basic idea: - Sample an auxiliary variable on the reals where z given x is normally distributed with a mean at 0. - Jointly transform x and z such that the $p(x, z)$ is roughly constant by using Hamiltonian dynamics. - Use a metropolis hastings step to accept or reject the transformed $(x,z)$. - The transformation is done using a dynamical system with Hamiltonian $H(x,z)$ forward in time. $H(x, z) = - log \\\ \pi (x) + \frac{1}{2} \sum^d_{i=1} z^2_i / m_i$ > Hamiltonian of a system is the sum of the kinetic energies of all the particles, plus the potential energy of the particles associated with the system. The abstract idea is to run the following dynamical system $ \frac{\partial x_i}{\partial t} = \frac{\partial H}{\partial z_i} = z_i / m_i $ $ \frac{\partial z_i}{\partial t} = -\frac{\partial H}{\partial x_i} = \frac{\partial}{\partial x_i} log\\\ \pi(x) $ #### Physical interpretation - x moves like a ball rolling on the surface of the log target density. - Physical interpretation: - $x$ = position coordinates - $z$ = momentum coordinates ($z = mv$) - minus log of the target density = potential energy - $\sum_i \frac{1}{2} m_i v^2_i$ = kinetic energy = $\frac{1}{2} \sum^d_{i=1} \frac{z_i^2}{m_i}$ - The Hamiltonian represents the total energy of the system - Due to conservation of energy the Hamiltonian remains constant as the dynamical system evolves over time. - Therefore $p(x, z) \approx exp(-H(,z))$ also remains constant as $(x,z)$ evolves according to the dynamical system. - Dynamics are defined by the following system: - Velocity = gradient of position - Force = mass X acceleration = gradient of momentum = gradient of the log target density > Example 1 : If target density is flat in some region then the gradient of the log target density is 0 and so there is no acceleration and consequently $x$ will move at a constant velocity in this region > Example 2: If the target density is not flat, then force = the gradient of the log target density and therefore x accelerates in the direction of the gradient. It accelerates towards a region with a higher density. #### Discretization - In practice we need to discretize. This means that the total energy (Hamiltonian) is not constant over time. - HMC uses leapfrog discretisation, allowing the hamiltonian to oscillate but prevents drift over time. ![[2022-02-24-hmc-alg.png]] - Step 2 is a discrete propagation of the dynamical system using a discrete approximation. - Using a leapfrog approximation mitigates how much error grows over time (a - c). a and c are identical half updates to z and and sandwhich the full x update. - This means that for each pair of updates a and c can be combined into a single step. - As $\epsilon$ goes to 0 the accuracy of the numerical approximation increases but number of steps ($L$) increases. Nice to choose these parameters such that $\epsilon L = 1$ to maintain this balance. - HMC only works for strictly positive target densities. Various approaches for dealing with this. #### HMC tuning parameters - HMC needs to have the covariance matrix of $z$, $\sigma > 0$ (the discretization step), and $L$ the number of leapfrog steps. - Roughly theory suggests that adjusting $\sigma$ to give an approximate 65% acceptance rate is optimal. - Trying to adapt through the full MCMC run can cause convergence failiure. - The "no U-turn sample" (NUTS) is a method for adapting HMC to local properties of the target distribution. - Can randomly perturb the tuning parameters throughout the MCMC run without causing convergence issues and can help avoid the algorithm getting stuck. #### Adapting MCMC over time - Mixing can be improved by looking at the MCMC samples themselves. - Roughly NUTS propagates the Hamiltonian dynamics until the trajectory starts going back towards where it started. - "Going back" means the momentum vector $z$ points in the direction of the starting position. ## Tags #blog/2022 #stan #bayesian-statistics