Full article: Reversible Jump PDMP Samplers for Variable Selection

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

A new class of Markov chain Monte Carlo (MCMC) algorithms, based on simulating piecewise deterministic Markov processes (PDMPs), has recently shown great promise: they are nonreversible, can mix better than standard MCMC algorithms, and can use subsampling ideas to speed up computation in big data scenarios. However, current PDMP samplers can only sample from posterior densities that are differentiable almost everywhere, which precludes their use for model choice. Motivated by variable selection problems, we show how to develop reversible jump PDMP samplers that can jointly explore the discrete space of models and the continuous space of parameters. Our framework is general: it takes any existing PDMP sampler, and adds two types of trans-dimensional moves that allow for the addition or removal of a variable from the model. We show how the rates of these trans-dimensional moves can be calculated so that the sampler has the correct invariant distribution. We remove a variable from a model when the associated parameter is zero, and this means that the rates of the trans-dimensional moves do not depend on the likelihood. It is, thus, easy to implement a reversible jump version of any PDMP sampler that can explore a fixed model. Supplementary materials for this article are available online.

Keywords:

1 Introduction

There is currently much interest in developing MCMC algorithms based on simulating piecewise deterministic Markov processes (PDMPs). These are continuous time Markov processes that have deterministic dynamics between a set of event times, and the randomness in these processes only comes through the random event times and potentially random transitions at the events (see Davis Citation1993, for an introduction to PDMPs).

The idea of simulating PDMPs to sample from a target distribution of interest originated in statistical physics (Peters and de With Citation2012; Michel, Kapfer, and Krauth Citation2014), but has recently been proposed as an alternative to standard MCMC to sample from posterior distributions in Bayesian Statistics, with algorithms such as the Bouncy Particle Sampler (Bouchard-Côté, Vollmer, and Doucet Citation2018) and the ZigZag algorithm (Bierkens and Roberts Citation2017; Bierkens, Fearnhead, and Roberts Citation2019) amongst others (Vanetti et al. Citation2017; Markovic and Sepehri Citation2018; Wu and Robert Citation2020; Michel, Durmus, and Sénécal Citation2020; Bierkens et al. Citation2020). See Fearnhead et al. (Citation2018) for an introduction to this area.

To sample from a density $π (θ)$ most current PDMP samplers introduce a velocity component, v, of the same dimension as $θ$ , and have deterministic dynamics that correspond to a constant velocity model (though see Vanetti et al. Citation2017, for alternative PDMP algorithms). At the random events the velocity component changes. Algorithms differ in terms of the event rate and how the velocity changes at each event, but each has a simple recipe for choosing these so that the resulting PDMP has $π (θ)$ as its invariant distribution. These recipes depend on $π (θ)$ through the gradient of $log π (θ)$ , which importantly means that $π (θ)$ only needs to be known up to proportionality, but also that $π (θ)$ needs to be differentiable almost everywhere. The advantages of PDMP samplers are that they are nonreversible, and thus can mix more quickly than standard reversible MCMC algorithms (Diaconis, Holmes, and Neal Citation2000), and, when sampling from posterior distributions, they can use a small sample of data points at each iteration whilst still targeting the true posterior distribution (Bierkens, Fearnhead, and Roberts Citation2019).

However, the restriction to sampling from densities that are differentiable means that current PDMP samplers cannot be used in model choice problems. The aim of this article is to address this limitation, with a particular motivation of PDMP samplers that can be used in variable selection problems that are common in, for example, linear regression and generalized linear models. We show how to design efficient PDMP samplers which allow movement between different models.

A simple way to implement PDMP samplers for variable selection problems is to use continuous spike-and-slab priors on the parameters (George and McCulloch Citation1993; Ishwaran and Rao Citation2005), which, rather than setting some parameters exactly to 0, have priors that place substantial mass close to 0. With such a prior, the resulting posterior density is differentiable, and existing PDMP samplers can be used (Goldman, Sell, and Singh Citation2021). However, such an approach has three disadvantages. First, under such a prior it can be hard to interpret the results as we do not formally get posterior probabilities on whether certain variables should be included in the model. Second, they introduce an extra tuning parameter to the prior which governs the shape of the spike of the component. Third, as we show in Section 2.2, using PDMP samplers to sample from the resulting posterior can be computationally inefficient: the samplers will need to simulate many events so that the parameters associated to variables that should not be in the model are kept close to 0.

We demonstrate how to adapt existing PDMP samplers to variable selection problems. Specifically they evolve as the PDMP sampler when exploring the posterior associated with a given model, but with two additional events: if any parameter value hits 0 the PDMP jumps to the smaller model where the corresponding variable is removed; whilst with some rate there are events that reintroduce variables into the model. We show in Section 3 how to calculate the rate and transition for these new types of event so that the sampler has the correct invariant distribution. To calculate these we need different techniques than those used for existing PDMP samplers, as we need to account for the behavior of the process when parameters hit zero. The techniques we use are most similar to those in Bierkens et al. (Citation2018), which considers PDMPs with restricted domains. However, in that paper the dynamics at the boundary of the domain could be chosen so that the net flow of probability at the boundary is zero; whereas we need to balance the probability flow out of a model which occurs when a parameter hits zero with the flow into the model caused by the events that reintroduce variables.

Our approach is not the only way of extending PDMP samplers to the variable selection problem. One could also introduce moves that propose adding or deleting a variable from the model regardless of the current state of the PDMP. Such ideas have been proposed for other continuous-time samplers, see Grenander and Miller (Citation1994) or Phillips and Smith (Citation1996) for their use within jump-diffusion samplers, and Stephens (Citation2000) for their use within continuous-time Markov jump process samplers. Exactly the same type of moves between models could be implemented for PDMPs. However, we believe our approach, that only allows removing variables when the corresponding parameter value hits zero, has important practical advantages. Most importantly, because we add or remove a variable when its corresponding parameter is 0 the rates of these events do not depend on the likelihood, but only the prior. This makes simulating these events simple—in fact for many common priors the rate at which we add a variable will be constant. Thus, any PDMP sampler suitable for a fixed model can easily be extended to allow for variable selection. By comparison the moves and rates described in Grenander and Miller (Citation1994) would involve rates that depend on the likelihood ratio between the current and new model. Simulating events with this rate will be challenging, as the rate will vary, in a complicated way, when the parameter values change. Calculating this rate will be costly as it requires evaluating the log-likelihood, which is a sum over the data points. Furthermore, most algorithms to simulate events require bounds on the rates, and calculating good bounds will be problem specific and potentially difficult. After a preprint of this work appeared (Chevallier, Fearnhead, and Sutton Citation2020), a related method, called the sticky PDMP (Bierkens et al. Citation2021), has been proposed. For variable selection problems, the main difference between the two approaches is that the sticky PDMP sampler remembers the velocity of a component when it is re-added back into the model, and thus will continue in the same direction as it was moving before that variable was removed.

The approach we present is generic, in that it can take any current PDMP sampler and be used to obtain a version that can be applied to the variable selection problem. We call the new class of samplers reversible jump PDMP samplers, due to the analogy with reversible jump MCMC (Green Citation1995), though the class of moves we allow are less general than those available for standard reversible jump MCMC. We show how to derive reversible-jump versions of both ZigZag and the Bouncy Particle Sampler in Section 4, before investigating empirically these algorithms on both logistic regression and robust linear regression models.

Proofs of all theorems are relegated to the supplementary materials. Code for implementing the new reversible jump PDMP samplers is available from the R package rjpdmp available from the Comprehensive R Archive Network (CRAN).

2 PDMP Samplers and Model Choice

2.1 Variable Selection

We will consider model selection problems that arise from variable selection. The general framework is that we have a vector of parameters, $θ = (θ_{1}, \dots, θ_{d})$ , and each model is characterized by setting some subset of the θ_js to 0. This is a common setting across linear models, generalized linear models and various extensions.

To make ideas concrete, consider a linear model $Y = \sum_{j = 1}^{d} X_{j} θ_{j} + ϵ$ where Y is a vector of response variables, each $X_{j}$ is a vector of covariates, and ϵ is an additive noise vector. When d is large it is common to fit such a model under a sparsity assumption, namely that many of the θ_js are 0.

In a Bayesian analysis, such a sparsity assumption is encapsulated in our choice of prior on $θ$ . To aid interpretation of the variable selection priors it is common to introduce a latent variable $γ = (γ_{1}, \dots, γ_{d})'$ where $γ_{j} = 1$ if the covariate $X_{j}$ is included in the model, that is, if $θ_{j} \neq 0$ . We let $| γ | = \sum γ_{j}$ be the number of covariates included in the corresponding model. Indexing $θ_{γ}$ as the sub-vector of $θ$ with only the selected variables, any prior can be written in a hierarchical form where we have a prior on $γ$ , then conditional on $γ$ we have a prior on $θ_{γ}$ , and set all remaining entries of $θ$ to 0.

A special case is where each component of $θ$ is independent of the others. In which case the prior can be written as $θ_{j} \sim w_{j} g_{j} (θ_{j}) + (1 - w_{j}) δ_{0} (θ_{j}), j = 1, \dots, d,$ where $w_{j} \in (0, 1)$ is the prior probability that $γ_{j} = 1$ , δ₀ is a Dirac measure at zero and $g_{j} (θ_{j})$ is a distribution that models our prior beliefs for θ_j conditional on that variable being included in the model. Bayesian approaches to variable selection that put a probability mass on $θ_{j} = 0$ in this way will be referred to as Dirac spike and slab methods. Notable examples of these methods include Mitchell and Beauchamp (Citation1988), Kuo and Mallick (Citation1998), Geweke (Citation1996), Smith and Kohn (Citation1996), and Bottolo and Richardson (Citation2010).

While this formulation is natural from a modeling perspective, sampling from the resulting posterior distribution can be challenging, with, for example, MCMC samplers that use gradient information such as Hamiltonian Monte Carlo (Neal Citation2011) not being applicable. To circumvent this issue it is common to use an approximation to this prior which replaces the point mass at 0 with a density that is peaked around 0, such as (1) $θ_{j} \sim w_{j} N (0, τ_{j}^{2}) + (1 - w_{j}) N (0, τ_{j}^{2} c_{j}^{2}), j = 1, \dots, d,$ (1) where c_j is taken small so that $N (0, c_{j}^{2} τ_{j}^{2})$ approximates the Dirac spike. We will refer to Bayesian variable selection methods that replace the Dirac in the prior with a continuous approximation as continuous spike-and-slab methods.

2.2 PDMP Samplers

Piecewise Deterministic Markov Processes (PDMPs) are an emerging class of nonreversible continuous-time samplers. We will consider sampling from a distribution with density $π (θ)$ defined on some space $X$ . Current samplers augment the state to include a velocity vector and sample from a distribution on $E = X \times V$ . In the following, for $z \in E$ we will use the notation $z = (θ, v)$ with $θ \in X$ a position and $v \in V$ a velocity.

A PDMP can be defined by (i) deterministic dynamics between a set of random event times; (ii) the state-dependent rate at which events occur, $λ (z)$ ; and (iii) a probability distribution for the change in state at each event, with density $q (z' | z)$ . We will consider PDMPs whose deterministic trajectories follow a differential equation of the form: $\frac{d (θ_{t}, v_{t})}{d t} = (v_{t}, Φ (θ_{t}, v_{t}))$ with $Φ : E \to V$ a smooth function. This setting contains the usual PDMP samplers such as ZigZag (Bierkens, Fearnhead, and Roberts Citation2019), the Bouncy Particle Sampler (Bouchard-Côté, Vollmer, and Doucet Citation2018), or the Coordinate Sampler (Wu and Robert Citation2020).

For example, the ZigZag algorithm to simulate from $π (θ)$ for $θ \in R^{d}$ , would introduce a velocity vector $v \in {- 1, 1}^{d}$ , and deterministic dynamics $\frac{d (θ_{t}, v_{t})}{d t} = (v_{t}, 0),$ which are the dynamics of a constant velocity model. For ZigZag, events occur at a rate that depends on the gradient of $π (θ)$ in each component of the velocity, and at an event one component of the velocity is switched. The rate at which the jth component of the velocity, $v_{t}^{j}$ , is switched is just $\max {0, - v_{t}^{j} \frac{\partial}{\partial θ^{j}} log π (θ_{t})} .$

These rates depend on the target distribution just through the gradient of the log of the target—which importantly means that we need only know the target distribution up to proportionality. Algorithm 1 gives computational pseudocode for simulating from the ZigZag process.

We can apply current PDMPs, such as ZigZag, to the Bayesian variable selection problem if we use the continuous spike-and-slab prior (1). Realizations of such a sampler are shown in the left two plots of as we vary how concentrated the spike distribution is. For the more concentrated case, the sampler becomes inefficient as it involves many switching events when the state variable is close to 0.

Fig. 1 Sample paths of PDMPs implementing variable selection in one dimension. The left and centre plots show the trajectories for a continuous spike-and-slab prior $0.5 N (0, τ^{2}) + 0.5 N (0, τ^{2} c^{2})$ where $τ^{2} = 16$ . As c decreases the spike component in the mixture approaches a Dirac mass. The figure on the right is the limiting process where we set the velocity to zero allowing the variable to stay fixed at zero.

Intuitively as we make the variance of the spike component of the prior tend to 0 the prior converges to a prior with a point mass at 0; furthermore we can observe the output of our PDMP sampler “converging” to a process shown in the right-hand plot of . Here, rather than the state having periods where it oscillates around 0, it has periods of time where $θ_{j} = 0$ and thus has many fewer events. This limiting process motivates the new class of PDMP samplers that we develop.

3 Reversible Jump PDMP Samplers

Let $M = (M_{1}, M_{2}, \dots)$ be a set of models indexed by a parameter k. Each model, $M_{k}$ , has a corresponding state space $X^{k}$ of dimension d_k. For the sake of clarity, we will limit ourselves to variable selection. In our case $X^{k}$ has a specific form: $X^{k} = \prod_{i} R^{γ_{i}^{k}}$ where we abuse notation by using $R^{0} = {0}$ and where $(γ_{i}^{k})$ is a sequence of numbers in {0, 1} representing whether a variable is enabled for model $M_{k}$ . Let π be the target posterior defined on $X = \cup_{k} X^{k}$ . We further assume that the restriction of π to each $X^{k}$ has a density, and denote this by $π_{k} (θ)$ . The first ingredient of our sampler is a collection of PDMPs defined for each model. Each PDMP sampler adds a velocity space $V^{k}$ to the space $X^{k}$ and samples from the space $E_{k} = X^{k} \times V^{k}$ . Finally, for each model $M_{k}$ , the associated PDMP sampler has an invariant distribution proportional to ν_k, where $ν_{k} (θ, v) = π_{k} (θ) p_{k} (v | θ) (θ, v) \in X^{k} \times V^{k},$ for some set of conditional densities $p_{k} (v | θ)$ .

Remark 1

. For samplers such as ZigZag, $p_{k} (d v | θ)$ is a measure with support on a discrete set—it places probability mass $1 / 2^{k}$ on each of the possible $2^{k}$ velocities allowed by ZigZag. By choosing $V^{k}$ to be the support of $p_{k} (d v | θ), p_{k} (d v | θ)$ still has a density—and integrals can be interpreted as sums over the support points. This allows us to treat all samplers within the same framework.

The second ingredient of our reversible jump PDMP sampler is a set of jumps between models. For our case of variable selection, we only allow adding or removing one variable at a time. Hence, let $\begin{matrix} T = {(i, j) | there exists a k such that γ_{k}^{i} = 1, γ_{k}^{j} = 0; \\ and γ_{l}^{i} = γ_{l}^{j}, l \neq k} \end{matrix}$ be a set of pairs of transitions between models, with these ordered (i, j) such that model $M_{j}$ is obtained from $M_{i}$ by removing one of the variables in $M_{i}$ . For transition $i \to j$ , we define an active boundary $Γ_{i, j} = X^{j} \times V^{i} \subset E_{i}$ , a subspace of E_i. Each trajectory passing through $Γ_{i, j}$ has some probability $p_{i, j}$ of jumping to E_j using a deterministic jump function $g_{i, j} : Γ_{i, j} \to E_{j}$ . We assume that the jump function does not change the parameter $θ$ . So, if a trajectory $z_{t}$ of our process has a left limit at time t, $z_{t -}$ in $Γ_{i, j}$ , then with some probability $p_{i, j}, z_{t} = g_{i, j} (z_{t -}) \in E_{j}$ with $θ_{t} = θ_{t -}$ . For transition $j \to i$ , we introduce a Poisson rate, $β_{i, j} (z)$ , and a jump kernel, $Q_{i, j} (\cdot | z)$ , such that if the trajectory is in E_j, then with rate $β_{i, j} (z_{t -}), z_{t}$ is drawn from $Q_{i, j} (\cdot | z_{t -}) \in Γ_{i, j}$ . We impose symmetry in the jumps between models so that for any $z' \sim Q_{i, j} (\cdot | z), g_{i, j} (z') = z$ , that is, the probability measure $Q_{i, j} (\cdot | z)$ is supported by $g_{i, j}^{- 1} (z)$ .

As an example, for the ZigZag sampler, where velocities are plus or minus 1 in each component, we could choose the function $g_{i, j} (z)$ that sets the velocity associated with the variable that is removed from the model to 0 and keeps other components unchanged. The reverse transition, defined by $Q_{i, j}$ , would then need to sample the velocity component associated with the added velocity component from ${- 1, 1}$ , and leave other components unchanged. If we have a sampler where the velocity is constrained to lie in the sphere, that is $v \cdot v = 1$ we cannot just set a component v to 0, as the resulting velocity will no longer lie on the sphere. Instead, $g_{i, j}$ could set the appropriate component of v to 0 and then renormalize the velocity. The reverse transition could propose a value of the velocity for the added component, from $[- 1, 1]$ , and then rescale the other velocity components.

3.1 Invariant Distribution

Let $ν = \sum_{k} ν_{k}$ be a measure on $\cup_{k} E_{k}$ , where ν_k is the invariant distribution of the base PDMP restricted to E_k. By construction, the $θ$ -marginal distribution of ν is π and this section provides sufficient conditions on $Q_{i, j}, p_{i, j}$ , and $β_{i, j}$ for ν to be an invariant distribution of the process.

To get a directly usable conditions on $β_{i, j}, p_{i, j},$ and $Q_{i, j}$ that ensure the target measure is invariant, some additional notation must be introduced. When jumping from $z = (θ, v) \in E_{j}$ to $z' = (θ, v') \in E_{i}$ , the dimension of the velocity vector needs to be increased by one, but the position is unchanged. (Strictly speaking, the position also increases in dimension by one, but the additional position variable is set to 0.) Hence, $g_{i, j}^{- 1} (z)$ is a one-dimensional manifold, which can be related to a subset, U say, of $R$ by introducing a function $G_{i, j} : U \times E_{j} \to E_{i}$ such that for any $α \in U$ , we have $G_{i, j} (α, z) \in g_{i, j}^{- 1} ({z})$ and for a fixed $z \in E_{j}, α \mapsto G_{i, j} (α, z)$ is a one-to-one mapping from U to $g_{i, j}^{- 1} (z)$ (similar ideas are seen in reversible jump MCMC; see Green Citation1995). For a given $z \in E_{j}$ , it is natural to rewrite the jump kernel $Q_{i, j}$ in terms of $α \in U$ , and henceforth we abuse notation and write $Q_{i, j} (\cdot | z)$ as a density on U. That is we can simulate the transition from E_j to E_i, by simulating $α \sim Q_{i, j} (\cdot | z)$ and then setting $z' = G_{i, j} (α, z)$ .

So for the ZigZag sampler, where $g_{i, j}$ sets a component of the velocity to 0, $g_{i, j}^{- 1} (z)$ , will be equal to the set of two states that have the same position as z and whose velocities are identical to that of z except for the component associated with the added variable. The two velocities in this set would correspond to replacing the velocity of 0 for that component with either 1 or –1. In this case U would just be the value of the velocity associated with the added component. We can simulate from the reverse transition by simulating U and calculating the new velocity from the old state z and U—this mapping defines $G_{i, j}$ .

Finally, define $n_{i, j}$ to be a normal to the boundary $Γ_{i, j}$ and let (2) $ν_{i, j} (z) = ν (z) | 〈 v, n_{i, j} 〉 |$ (2) be an unnormalized density on $Γ_{i, j}$ and let ${\bar{ν}}_{i, j}$ be the pushforward measure on E_j of $ν_{i, j}$ by $g_{i, j}$ : (3) ${\bar{ν}}_{i, j} (B) = \int_{g_{i, j}^{- 1} (B)} ν_{i, j} (z) d z for any B \subset E_{j} measurable .$ (3)

Informally this is the measure under $ν_{i, j}$ associated with values of $θ \in Γ_{i, j}$ that would be mapped by $g_{i, j}$ to the set B.

For our example of the ZigZag sampler, $| 〈 v, n_{i, j} 〉 | = 1$ as velocities are defined so that they are ±1 in all directions normal to the boundaries. Thus, $ν_{i, j} (z)$ is just the un-normalized density of z under $ν (z)$ restricted to $z \in Γ_{i, j}$ . The push-forward measure ${\tilde{ν}}_{i, j}$ for a set B in the space of the smaller model is just the measure under $ν_{i, j}$ of all the values of the state z on the boundary $Γ_{i, j}$ of the larger model that would be mapped to B when we moved model. That is, for a state in B we consider the pair of states that have the same position and whose velocity is the same except that the velocity of 0 at the removed component is replaced with 1 or –1, and calculate the measure of all such pairs of states corresponding to states in B.

We now present our conditions on $β_{i, j}, Q_{i, j}$ and $p_{i, j}$ . It is helpful to consider separately the cases where the space of velocities is continuous and the case where it is discrete.

Theorem 1

. Assume the space of velocities is continuous, that for each k, the base PDMP on E_k has ν_k as its invariant distribution, and define $λ (z)$ for $z \in E_{k}$ to be the jump rate of the base PDMP on E_k. Further assume:

the mean jump rate of the base PDMPs is finite, that is, $\int λ (z) d ν < \infty$ ;
for all k, π_k is in C¹, that is, continuously differentiable, on $X_{k}$ ;
for any k, and any $v \in V^{k}, \int_{X_{k}} | \nabla π_{k} (θ) \cdot v | d θ < \infty$ .

Then the measure ν is invariant if for every $(i, j) \in T$ , the following conditions hold (4) $β_{i, j} (z) = p_{i, j} \frac{{\bar{ν}}_{i, j} (z)}{ν (z)} for z \in E_{j}$ (4) (5) $Q_{i, j} (α | z) = \frac{ν_{i, j} (G_{i, j} (α, z)) | J_{G_{i, j}} (α, z) |}{{\bar{ν}}_{i, j} (z)} for z \in E_{j} and α \in U,$ (5) where $J_{G_{i, j}}$ denotes the Jacobian associated with the transformation $G_{i, j}$ .

Proof.

See supplementary materials. □

Intuitively, this result can be understood as a detailed balance condition: we balance the probability flow for each jump $z \to z'$ from E_i to E_j, with that of the reverse jump $z' \to z$ .

For discrete velocity spaces the result is slightly simpler, as the Jacobian term is not required. In this case we replace (5) with the condition (6) $Q_{i, j} (α | z) = \frac{ν_{i, j} (G_{i, j} (α, z))}{{\bar{ν}}_{i, j} (z)} for z \in E_{j} and α \in U .$ (6)

3.2 A Reversible Jump PDMP Algorithm

Pseudo-code outlining how we can simulate the resulting Reversible Jump PDMP is given in Algorithm 2. In Lines 2–12, we loop over possible events—which can either be events where we jump to a model with an additional active variable, a component of the PDMP position hitting zero, or events within the base PDMP—and simulate the times for each of these possible events. In Lines 13–17 we calculate when the first of these possible events occurs and update the process to that time. In Lines 18–32 we then update the PDMP state based on which type of event has occurred.

In this algorithm, the additional computational cost of the reversible jump moves, over and above the cost of simulating the base PDMP sampler, is proportional to the number of variables. For each active variable we have to calculate the next time its position hits 0, which involves solving a scalar linear equation. For each inactive variable we need to simulate the time at which we next introduce the variable. As we show in the next section, in most cases the rates for these event are simple, often constant, and importantly do not depend on the likelihood. In practice more efficient implementations than Algorithm 2 will be possible that reuse previously simulated times of events. For example, if an event does not change the velocity associated with a given variable, then we do not need to recalculate the time that variable hits 0.

4 Example Samplers

4.1 ZigZag Sampler

We first derive the jump rates and transitions for the ZigZag sampler described in Section 2.2.

We choose $g_{i, j}$ to be the orthogonal projection, that is the projection that sets the velocity of the disabled variable to 0. Let $(i, j) \in T$ be a transition. For any $v \in V^{i}$ , we have $| 〈 v, n_{i, j} 〉 | = 1$ , thus from our definition of $ν_{i, j} (z)$ in (2) $ν_{i, j} (z) = ν (z) = π_{i} (θ) 2^{- | γ_{i} |} .$

For $z \in E_{j}$ , since a velocity component in ${- 1, 1}$ is projected to 0, then from (3) ${\bar{ν}}_{i, j} (z) = 2 π_{i} (θ) 2^{- | γ_{i} |} = π_{i} (θ) 2^{- | γ_{j} |} .$

Since $g_{i, j}$ is the projection that sets to 0 the disabled variable, $g_{i, j}^{- 1} (θ, v) = {(θ, v + n_{i, j}), (θ, v - n_{i, j})}$ and denoting the new velocity of the added component by $α \in {- 1, 1}$ , we chose $G_{i, j} (α, θ, v) = α n_{i, j} + v .$

Furthermore, we have from the version of Theorem 1 for discrete velocity spaces that (7) $β_{i, j} (z) = p_{i, j} \frac{π_{i} (θ)}{π_{j} (θ)}$ (7) (8) $Q_{i, j} (α | z) = 1 / 2 for α \in {- 1, 1} .$ (8)

For our variable selection problem, the ratio of the posterior density that appears in $β_{i, j}$ will simplify to the ratio of the priors as the likelihood terms are common and cancel. If we have independent priors on the parameters for each variable this term will be a constant, which simplifies the simulation of the events at which we add new variables into our model. This comment applies also to the rates for the Bouncy Particle Sampler which we derive next. See the Supplementary Material for pseudocode for simulating from this process.

4.2 Bouncy Particle Sampler

We consider two versions of the Bouncy Particle Sampler. The first version has velocities on the unit sphere, so $V^{i} = {v \in R^{| γ^{k} |} such that | | v | | = 1} .$

Like the ZigZag sampler, the deterministic dynamics are given by a constant velocity model. The event rate for sampling from a density $π_{k} (θ)$ is $λ (z) = \max {0, - v \cdot \nabla_{θ} log π_{k} (θ)},$ with the velocity reflecting in the normal to $log π_{k} (θ)$ at an event: if $n_{k} (x)$ is the normal to $log π_{k} (θ)$ , then the new velocity is $v' = v - 2 (\frac{v \cdot n_{k} (x)}{n_{k} (x) \cdot n_{k} (x)}) n_{k} (x) .$

The Bouncy Particle Sampler also often has refresh events, at which a completely new velocity is sampled.

Extending the Bouncy Particle Sampler to the variable selection problem requires a more careful analysis than for the ZigZag sampler due to the geometry of the velocity space. Since velocities lie in the unit sphere, we choose $g_{i, j}$ to be the orthogonal projection followed by a rescaling. Hence, $g_{i, j}^{- 1} (θ, v) = {(θ, α n_{i, j} + \sqrt{1 - α^{2}} v) | α \in [- 1, 1]}$ and denoting the new velocity of the added component by $α \in [- 1, 1]$ , we chose $G_{i, j} (α, θ, v) = α n_{i, j} + \sqrt{1 - α^{2}} v .$

The following proposition states how to choose the jump rate, $β_{i, j}$ and the density for α.

Proposition 1

. For the Bouncy Particle Sampler with velocities on the unit sphere, (4) and (5) are satisfied if, for all $(i, j) \in T$ with $| γ^{j} | > 0$ : $\begin{matrix} β_{i, j} (z) = p_{i, j} \frac{π_{i} (θ)}{π_{j} (θ)} \frac{2 A_{sphere} (| γ^{j} |)}{A_{sphere} (| γ^{i} |)} \frac{1}{| γ^{j} |} \\ Q_{i, j} (α | z) = \frac{| α | | γ^{j} | {\sqrt{1 - α^{2}}}^{| γ^{j} | - 2}}{2} for z \in E_{j} and α \in (- 1, 1) \end{matrix}$ with $A_{sphere} (| γ^{i} |) = \frac{Γ (\frac{| γ^{i} |}{2})}{2 π^{\frac{| γ^{i} |}{2}}}$ the area of the unit sphere of $R^{| γ^{i} |}$ ; and if $| γ^{j} | = 0$ , where the Bouncy Particle Sampler and ZigZag are equivalent, we use (7) and (8).

The second version of the Bouncy Particle Sampler has velocities in $R^{| γ^{k} |}$ , with their density being standard Gaussian and independent of $θ$ . The dynamics are as previously. The geometry of this case is simpler and we choose $g_{i, j}$ to be the orthogonal projection of the velocity. Hence, $g_{i, j}^{- 1} (θ, v) = {(θ, v + α n_{i, j}) | α \in R}$ and we chose $G_{i, j} (α, θ, v) = α n_{i, j} + v .$

Proposition 2

. For the Bouncy Particle Sampler with Gaussian velocities, (4) and (5) are satisfied if, for all $(i, j) \in T$ : $\begin{matrix} β_{i, j} (z) = p_{i, j} \frac{π_{i} (θ)}{π_{j} (θ)} \frac{2}{\sqrt{2 π}} \\ Q_{i, j} (α | z) = 2 | α | e^{- \frac{1}{2} α^{2}} for z \in E_{j} and α \in R . \end{matrix}$

5 Simulation Study

In this section we demonstrate the potential advantage of our new samplers compared to alternative approaches for Bayesian variable selection. To compare between different samplers we consider the Monte Carlo estimates of the posterior probabilities of inclusion, the posterior means for the regression coefficients, and the posterior means conditioned on the model. For a given sampler the statistical efficiency is measured by the mean squared error of the sampler, denoted by $σ_{sampler}^{2}$ , and is calculated using R runs of the sampler as (9) $σ_{sampler}^{2} = \frac{1}{R} \sum_{i = 1}^{R} {({\hat{q}}_{r} - q)}^{2}$ (9) where ${\hat{q}}_{r}$ for $r = 1, \dots, R$ are the estimates of a quantity of interest from the R runs, and q is either the exact posterior quantity of interest, if available, or it is the estimate from an independent long run of an MCMC method. In multiple dimensions the statistical efficiency is measured as the median $σ_{sampler}^{2}$ over all dimensions. To compare the performance of different samplers we also consider a measure of efficiency relative to a reference sampler. If we denote the reference sampler by ref, then we define Relative Statistical Efficiency (RSE) and Relative Efficiency (RE) by $RSE = \frac{σ_{ref}^{2} n_{ref}}{σ_{sampler}^{2} n_{sampler}}, RE = \frac{σ_{ref}^{2} t_{ref}}{σ_{sampler}^{2} t_{sampler}},$ where $n_{sampler}$ and $n_{ref}$ are the number of iterations of the algorithms and $t_{sampler}$ and $t_{ref}$ are the computation times of the algorithms. The RSE measures the relative efficiency of the algorithms per iteration whereas the RE measures the efficiency per second. For interpretation, an RSE or RE value of 2 implies that the sampler is 2 times more efficient than the reference method. The sensitivity of the methods to the choice of reversible jump parameters $p_{i, j}$ and regular PDMP tuning parameters is explored empirically in the Supplementary Material. Based on these results we fixed $p_{i, j} = 0.6$ for all i and j in all reversible jump PDMP samplers and implemented BPS with normal velocity distribution with a fixed refreshment rate of 0.1 that was constant across models. We set this rate to be 0.1 based on results from some initial tuning runs.

5.1 Logistic Regression

First we compare PDMP based samplers with existing MCMC methods for a logistic regression model with spike and slab priors on the regression coefficients. The MCMC competitors are a collapsed Gibbs sampler, and two reversible jump samplers. The Gibbs sampler uses the Polya-Gamma augmentation of Polson, Scott, and Windle (Citation2013) to make efficient moves through model space. We implemented one reversible jump MCMC sampler using the NIMBLE software package (de Valpine et al. Citation2017) using an independent normal proposal for selected variables. The other is a reversible-jump version of HMC, which, at each iteration, with probability 1/2 uses an HMC move within a model, and otherwise uses a reversible jump move between models. More details on the samplers are given in the Supplementary Material.

The logistic regression model has a d-dimensional regression parameter $θ \in R^{d}$ , and a binary response $y_{i} \in {0, 1}$ which is distributed as $P (y_{i} = 1 | x_{i}, θ) = \frac{exp (\sum_{j = 1}^{d} x_{i, j} θ_{j})}{1 + exp (\sum_{j = 1}^{d} x_{i, j} θ_{j})}$ where $x_{i}$ is the d-vector of covariates for observation i. In our simulation study, each vector $x_{i}$ is simulated from a multivariate normal with mean zero and d × d covariance matrix $Σ$ . We use a prior that is independent for each θ_j and where $θ_{j} \sim \frac{10}{d} N (0, 10) + (1 - \frac{10}{d}) δ_{0}$ , which corresponds to a prior that favors models with 10 selected variables. Data was generated using this model and the following choices for $θ$ and covariance matrix $Σ$ :

A pair of correlated covariates, one of which is in the model: $θ = {(1, 0, \dots 0)}^{T}$ with $Σ_{2, 1} = Σ_{2, 1} = 0.9, Σ_{i, i} = 1$ , and $Σ_{i, j} = 0$ otherwise.
Structured correlation between all covariates with six active covariates:
$θ = {(3, 3, - 2, 3, 3, - 2, 0, \dots, 0)}^{T}$ with $Σ_{i, j} = exp (- | i - j |)$ .
No correlation between covariates and six active covariates:
$θ = {(3, 3, - 2, 3, 3, - 2, 0, \dots, 0)}^{T}$ , with $Σ_{i, i} = 1$ and $Σ_{i, j} = 0$ if $i \neq j$ .
Multiple pairs of correlated variables with six active covariates:
$θ = {(3, 3, - 2, 3, 3, - 2, 0, \dots, 0)}^{T}$ with $Σ_{i + d / 2, i} = Σ_{i, i + d / 2} = 0.9$ for $1 \leq i \leq 6, Σ_{i, i} = 1$ and $Σ_{i, j} = 0$ otherwise.

These simulation scenarios are analogous to others previously considered in the literature for linear regression. Scenarios 1 and 3 are similar to those considered by Wang et al. (Citation2011) and Zanella and Roberts (Citation2019) while Scenario 2 is similar to one considered by Yang, Wainwright, and Jordan (Citation2016). Scenario 4 is an extension of Scenario 1 to allow for more correlated variables (and results for a further extension that allows for higher correlation is shown in the supplementary materials). We present results for both ZigZag and the Bouncy Particle Sampler with Gaussian distributed velocities in (very similar results are obtained using the Bouncy Particle Sampler with velocities uniformly on the unit sphere).

Table 1 Scenario 1 (pair of correlated variables): Relative efficiencies for methods, against a Reversible Jump algorithm, for the marginal posterior means (Mean) and marginal posterior probabilities of inclusion (PI).

Download CSV Display Table

Table 2 Scenario 2 (General correlation): Relative efficiency for methods, against a Reversible Jump algorithm, for the marginal posterior means (Mean) and marginal posterior probabilities of inclusion (PI). Bold figures show the best performing sampler.

Download CSV Display Table

Table 3 Scenario 3 (No correlation): Relative efficiency for methods, against a Reversible Jump algorithm, for the marginal posterior means (Mean) and marginal posterior probabilities of inclusion (PI).

Download CSV Display Table

Table 4 Scenario 4 (multiple correlated pairs): Relative efficiency for methods, against a Reversible Jump algorithm, for the marginal posterior means (Mean) and marginal posterior probabilities of inclusion (PI).

Download CSV Display Table

Across the four scenarios, Gibbs variable selection performs the best in low sample sizes (n < d). PDMP methods are competitive with Gibbs sampling when the number of predictors and sample size is comparable and offer substantial efficiency gains for larger sample sizes. Smaller gains can also be seen when the dimension is increased with fixed sample size. Both BPS and ZigZag methods offer similar relative efficiencies across the experiments. The greatest efficiency gains for our PDMP methods are seen in with offering lower gains in performance. This may be due to smaller models being more likely in Scenario 1, as these require less computational effort for the PDMP methods since fewer gradient calculations are required to simulate within these lower dimensional models. RJ-HMC offers improvement to the simpler independent normal reversible jump sampler for larger sample sizes. While HMC offers better within model moves it comes at a higher computational cost and no real advantage to moving through model space.

Unlike the reversible jump approaches, the Gibbs sampler makes efficient moves through model space. This sampler makes use of marginalization over the parameter values $θ$ to more effectively jump between models. However, when the number of observations is large there is increased computational requirement for sampling the Polya-Gamma random variables. This results in less efficient overall sampling than the PDMP samplers when n is large.

The use of subsampling methods for Bayesian variable selection problems is a recently emerging area (Song et al. Citation2020; Buchholz, Ahfock, and Richardson Citation2019). One of the attractions of PDMP samplers is that they can be implemented in a way where they only access a small subset of data at each iteration, while still targeting the true posterior. We now investigate how these ideas work in the variable selection problem, by comparing the efficiency of three implementations of ZigZag with that of the Gibbs sampler, and see how this depends on the number of observations. These are ZigZag using the full data, ZigZag using subsampling with a global bound, and ZigZag with subsampling control variates (see Bierkens, Fearnhead, and Roberts Citation2019, for details of both subsampling approaches).

Standard application of control variates requires calculation of the gradient at a reference point using the full likelihood. Due to the trans-dimensional nature of variable selection problems, a full gradient calculation is not well defined. For this reason we choose to make use of control variates defined for a fixed model $M$ where the gradient is well defined. These control variates are only used when the sampler is in this model. For certain problems, such as generalized linear models, an O(n) initial calculation relating to the likelihood can be reused to define control variates for multiple models with certain parameters set to zero. See the supplementary material for more details.

Results are shown in . In this simulation we take the independent prior $θ_{j} \sim \frac{1}{15} N (0, 10) + (1 - \frac{1}{15}) δ_{0}$ , favoring models with a single selected variable. Despite our simplistic implementation, these results indicate that Zig-Zag with control variates is becoming increasingly efficient relative to Zig-Zag using the full dataset as the number of samples increases. Furthermore we see evidence of super-efficiency—whilst the computational cost per ESS of the Gibbs sampler is expected to be linear in the number of observations, the relative efficiency plots suggest that this is sublinear for ZigZag with control variates.

Fig. 2 Log-log plots of efficiency, relative to the Gibbs sampler, of different samplers as we vary the number of observations. Plotted are the relative efficiencies for the posterior mean conditional on model $M^{*}$ where $M^{*}$ corresponds to the true data generated model. The dataset was generated with a 15-dimensional regression parameter $θ = (1, 1, 0, 0, \dots, 0)$ . The methods run are the Zig-Zag applied to the full dataset (zz, black), Zig-Zag with subsampling using global bounds (ss, blue), Zig-Zag with control variates (cv, magenta) and Gibbs sampling (Gibbs, green). All methods were initialized at the location of the control variate. Methods were given the same computational budget, for details see the supplementary materials.

5.2 Robust Regression

As mentioned in the introduction, a common approach to Bayesian variable selection is to use continuous spike-and-slab priors for each parameter rather than try to sample from the joint posterior of model and parameters. Such an approach is attractive as it enables standard gradient-based samplers, such as Hamiltonian Monte Carlo, to be used. We now compare such an approach, implemented with the popular Stan software (Carpenter et al. Citation2017), to our PDMP samplers. Our aim is to both investigate the computational efficiencies of the two approaches and to show the differences in posterior that we obtain from these different types of prior. Our comparison is based on a robust linear regression model.

In particular, we model the errors in our linear regression model as a mixture of normals with different variances. Thus, $\begin{matrix} Y = \sum_{i = 1}^{d} X_{j} θ_{j} + ϵ, \\ ϵ \sim \frac{1}{2} N (0, 1) + \frac{1}{2} N (0, 10^{2}) . \end{matrix}$

The continuous variable selection prior we consider is the regularized horseshoe (Piironen and Vehtari Citation2017a, Citation2017b) $\begin{matrix} θ_{j} \sim N (0, τ^{2} {\tilde{λ}}_{j}), \\ {\tilde{λ}}_{j} = \frac{c^{2} λ_{j}}{c^{2} + τ^{2} λ_{j}}, \\ λ_{j} \sim C^{+} (0, 1) \end{matrix}$ for $j = 1, \dots, d$ where $C^{+} (0, 1)$ denotes the half-Cauchy distribution for the standard deviation λ_j. The regularized horseshoe is a variation of the horseshoe prior (Carvalho, Polson, and Scott Citation2010) that offers a continuous approximation of a Dirac spike and slab where the slab is a normal distribution with finite variance c². The hyperparameter τ controls the global shrinkage of the variables toward zero. In Carvalho, Polson, and Scott (Citation2010) it was shown that for standard linear regression the optimal choice for a fixed value of τ is $τ_{0} = σ \frac{d_{0}}{(d - d_{0}) \sqrt{n}}$ where d₀ is a the number of nonzero variables in the sparse model and σ is the noise variance. In line with this, we compare the regularized horseshoe prior with fixed hyperparameter $τ_{0} = \frac{d_{0}}{(d - d_{0}) \sqrt{n}}$ against the spike-and-slab prior $θ_{j} \sim \frac{d_{0}}{d} N (0, c^{2}) + (1 - \frac{d_{0}}{d}) δ_{0}$ for $j = 1, \dots, d$ . MCMC for the model using a horseshoe prior was performed by Stan’s implementation of NUTS (Hoffman and Gelman Citation2014).

We first compare the variable selection dynamics for a simple model with d = 4 variables, n = 120 observations and regression parameter $θ = {(0.5, 0.5, 0, 0)}^{T}$ . The covariate values and residuals were generated as independent draws from a standard normal and the prior expected model size is set to $d_{0} = 1$ . Example output for the PDMP samplers and the Stan implementation is shown in . The posteriors show the horseshoe prior replicating the effect of the spike-and-slab through shrinking the coefficients toward zero, but it is not able to give exact zeros.

Fig. 3 Dynamics of the samplers on a robust regression example with spike and slab or horseshoe prior. The top row shows the posterior for θ₁ and θ₂, bottom row shows the estimates for θ₂ and θ₃. The spike-and-slab distributions are sampled using the reversible jump PDMP samplers with reversible jump parameter 0.6 and refreshment for the BPS methods set to 0.5. All methods are shown with 10³ samples (red) and the PDMP dynamics are shown in black. Sampling with the Horseshoe prior was implemented in Stan using NUTS. Both Stan and PDMP methods were run for the same computing time. To aid visualization only the first 30% of the PDMP trajectories are shown.

Fig. 3 Dynamics of the samplers on a robust regression example with spike and slab or horseshoe prior. The top row shows the posterior for θ1 and θ2, bottom row shows the estimates for θ2 and θ3. The spike-and-slab distributions are sampled using the reversible jump PDMP samplers with reversible jump parameter 0.6 and refreshment for the BPS methods set to 0.5. All methods are shown with 103 samples (red) and the PDMP dynamics are shown in black. Sampling with the Horseshoe prior was implemented in Stan using NUTS. Both Stan and PDMP methods were run for the same computing time. To aid visualization only the first 30% of the PDMP trajectories are shown.

We now compare the reversible jump PDMP methods in terms of their sampling efficiency for a higher dimensional problem. The dataset is generated for d = 200 variables and n = 100 observations with regression parameter $θ = {(2, 2, 2, 2, 0, 0, 0 \dots 0)}^{T}$ . The covariates for each observation were drawn from an AR(1) process with lag-1 correlation of 0.5. The residuals were generated from a standard Cauchy distribution.

We ran Stan with the default settings for a burnin of 1000 iterations and then for 16, 32, 64,…, 2048 iterations. We ran the reversible jump PDMP samplers for the same wall clock time for both burnin and subsequent iterations. The experiment was repeated 50 times and the resulting boxplots of the posterior mean of θ₁ are shown in . For the same computational budget, the reversible jump PDMP methods are able to attain better performance as can be seen by the lower Monte Carlo variability of their estimates. However, it is also apparent from this simulation that HMC is less susceptible to local modes, perhaps due to the horseshoe prior being continuous.

Fig. 4 Sampling efficiency for reversible jump PDMP versus Stan for the robust regression example. The PDMP samplers are ZigZag (zz), Bouncy Particle Sample with normally distributed velocities (bps_n) and with velocities distributed uniformly on the sphere (bps_s). The top figure shows boxplots of the posterior mean of θ₁ for increasing computational budget, with outliers from the sampler removed for visualization purposes. These removed outliers correspond to times that the sampler has become stuck in a local mode where $θ_{1} = 0$ . The subplot shows the full results including outliers from the samplers. The Stan sampler is sampling from a different posterior to the PDMP methods, and this is seen in the estimates converging to slightly different values; but Monte Carlo efficiency can be assessed by comparing the variability of the estimates. The bottom figure shows the number of times that the samplers did not find the global mode.

The predictive abilities of the methods are compared in . Here an additional n = 100 observations were drawn from the same model and these were used as a hold-out dataset to validate the posterior predictive ability. The predictive ability is defined in terms of the Monte Carlo estimate of the mean square prediction performance $\frac{1}{100} \sum_{i = 1}^{100} {(y_{i} - x_{i}^{T} θ)}^{2}$ where $θ$ is replaced by the samples generated by either Stan or samples given by a discretization of time for the reversible jump PDMP samplers. The reversible jump PDMP samplers all give the same predictive performance for large iteration numbers while Stan, which uses a horseshoe prior, performs slightly worse.

Fig. 5 Predictive ability of reversible jump PDMP versus Stan for the robust regression example. The PDMP samplers are ZigZag (zz), Bouncy Particle Sample with normally distributed velocities (bps_n) and with velocities distributed uniformly on the sphere (bps_s). The predictive ability is measured by Monte Carlo estimates of the mean square predictive performance.

6 Discussion

We have shown how PDMP samplers can be extended so that they can sample from the joint posterior over model and parameters in variable selection problems. There are a number of open challenges that stem from this work. As with any MCMC algorithm, the reversible jump PDMP samplers have tuning parameters. The additional tuning parameters are the probabilities of moving between models when parameters hit zero. As a default, we recommend setting these probabilities all to the same value, and our simulation results were based on choosing this value after empirically evaluating the performance of the samplers on one simple problem. Whilst the samplers mixed well, it is likely that better mixing could be achieved if more informed choices of tuning parameters were made, and theory for guiding such choices is needed (e.g., see Sherlock and Thiery Citation2020, for theory on choosing the refresh rate of the Bouncy Particle Sampler).

The form of our reversible jump PDMP samplers is based on particular features of the variable selection problem. In other model choice settings, different trans-dimensional moves may be needed. The theory we developed should be able to be adapted to give rules for choosing rates of such moves. For example, in the case of sampling from mixture models one could introduce moves that remove a component when that component’s weight hits zero; and when we add a component we simulate new values for the component parameters from the prior. An advantage of such a construction would be that the rate of adding or removing components would not depend on the likelihood. Also our trans-dimensional moves are reversible, that is they balance probability flow from model i to model j by the flow of probability from model j to model i—it would be interesting to see if nonreversible trans-dimensional moves could be constructed.

It is likely that the reversible jump PDMP samplers will still struggle in situations where the posterior is multi-modal with well separated modes. For such cases it would be interesting to try and incorporate ideas such as tempering (Marinari and Parisi Citation1992) to allow for better mixing across modes. Also better mixing may be possible if we could introduce non-reversibility into exploring models, as in Power and Goldman (Citation2019) and Gagnon and Maire (Citation2020), though it is difficult to see how to incorporate lifting ideas that introduce such non-reversibility to our samplers. Finally, whilst we have considered only models with a finite number of variables, we believe the theory and PDMPs would extend to models with a countable number of variables, such as for polynomial regression. The challenge in using such PDMPs will be efficiently sampling when new variables are introduced into the model. If the rate at which each variable is added is simple and we can analytically calculate the sum of these rates, then this should be possible by simulating when a variable is added with a rate equal to this sum. At each such event we then simulate which variable to add.

Supplemental material

Supplemental Material

Download PDF (14 MB)

Supplemental Material

Download PDF (205.9 KB)

Supplementary Materials

The supplementary material contains all proofs; results of empirical investigation of tuning parameters; additional results for the logistic regression model; details of implementations of the algorithms for the examples; and pseudo-code for the reversible-jump ZigZag algorithm.

Additional information

Funding

This research was supported by EPSRC grants EP/R018561 and EP/R034710.

References

Bierkens, J., Bouchard-Côté, A., Doucet, A., Duncan, A. B., Fearnhead, P., Lienart, T., Roberts, G. O., and Vollmer, S. J. (2018), “Piecewise Deterministic Markov Processes for Scalable Monte Carlo on Restricted Domains,” Statistics & Probability Letters, 136, 148–154.
Web of Science ®Google Scholar
Bierkens, J., Fearnhead, P., and Roberts, G. (2019), “The zig-zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data,” The Annals of Statistics, 47, 1288–1320. DOI: 10.1214/18-AOS1715.
Web of Science ®Google Scholar
Bierkens, J., Grazzi, S., Kamatani, K., and Roberts, G. (2020), “The Boomerang Sampler,” in International Conference on Machine Learning, PMLR, pp. 908–918.
Google Scholar
Bierkens, J., Grazzi, S., van der Meulen, F., and Schauer, M. (2021), “Sticky PDMP Samplers for Sparse and Local Inference Problems,” arXiv:2103.08478.
Google Scholar
Bierkens, J., and Roberts, G. (2017), “A Piecewise Deterministic Scaling Limit of Lifted Metropolis–Hastings in the Curie–Weiss Model,” The Annals of Applied Probability, 27, 846–882. DOI: 10.1214/16-AAP1217.
Web of Science ®Google Scholar
Bottolo, L., and Richardson, S. (2010), “Evolutionary Stochastic Search for Bayesian Model Exploration,” Bayesian Analysis, 5, 583–618. DOI: 10.1214/10-BA523.
Web of Science ®Google Scholar
Bouchard-Côté, A., Vollmer, S. J., and Doucet, A. (2018), “The Bouncy Particle Sampler: A Nonreversible Rejection-Free Markov Chain Monte Carlo method,” Journal of the American Statistical Association, 113, 855–867. DOI: 10.1080/01621459.2017.1294075.
Web of Science ®Google Scholar
Buchholz, A., Ahfock, D., and Richardson, S. (2019), “Distributed Computation for Marginal Likelihood based Model Choice,” arXiv.1910.04672.
Google Scholar
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017), “Stan: A Probabilistic Programming Language,” Journal of Statistical Software, 76, 1–32. DOI: 10.18637/jss.v076.i01.
PubMed Web of Science ®Google Scholar
Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010), “The Horseshoe Estimator for Sparse Signals,” Biometrika, 97, 465–480. DOI: 10.1093/biomet/asq017.
Web of Science ®Google Scholar
Chevallier, A., Fearnhead, P., and Sutton, M. (2020), “Reversible Jump PDMP Samplers for Variable Selection,” arXiv:2010.11771.
Google Scholar
Davis, M. (1993), Markov Models and Optimization, New York: Chapman Hall.
Google Scholar
de Valpine, P., Turek, D., Paciorek, C., Anderson-Bergman, C., Temple Lang, D., and Bodik, R. (2017), “Programming with Models: Writing Statistical Algorithms for General Model Structures with NIMBLE,” Journal of Computational and Graphical Statistics, 26, 403–413. DOI: 10.1080/10618600.2016.1172487.
Web of Science ®Google Scholar
Diaconis, P., Holmes, S., and Neal, R. M. (2000), “Analysis of a Nonreversible Markov Chain Sampler,” Annals of Applied Probability, 10, 726–752.
Web of Science ®Google Scholar
Fearnhead, P., Bierkens, J., Pollock, M., and Roberts, G. O. (2018), “Piecewise Deterministic Markov Processes for Continuous-Time Monte Carlo,” Statistical Science, 33, 386–412. DOI: 10.1214/18-STS648.
Web of Science ®Google Scholar
Gagnon, P., and Maire, F. (2020), “Lifted Samplers for Partially Ordered Discrete State-Spaces,” arXiv:2003.05492.
Google Scholar
George, E. I., and McCulloch, R. E. (1993), “Variable Selection via Gibbs Sampling,” Journal of the American Statistical Association, 88, 881–889. DOI: 10.1080/01621459.1993.10476353.
Web of Science ®Google Scholar
Geweke, J. (1996), “Variable Selection and Model Comparison in Regression,” in Bayesian Statistics 5, pp. 609–620, Oxford: Oxford University Press.
Google Scholar
Goldman, J. V., Sell, T., and Singh, S. S. (2021), “Gradient-based Markov Chain Monte Carlo for Bayesian Inference with Non-differentiable Priors,” Journal of the American Statistical Association, 1–12. DOI: 10.1080/01621459.2021.1909600.
PubMed Web of Science ®Google Scholar
Green, P. J. (1995), “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination,” Biometrika, 82, 711–732. DOI: 10.1093/biomet/82.4.711.
Web of Science ®Google Scholar
Grenander, U., and Miller, M. I. (1994), “Representations of Knowledge in Complex Systems,” Journal of the Royal Statistical Society, Series B, 56, 549–581. DOI: 10.1111/j.2517-6161.1994.tb02000.x.
Google Scholar
Hoffman, M. D., and Gelman, A. (2014), “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo,” Journal of Machine Learning Research, 15, 1593–1623.
Web of Science ®Google Scholar
Ishwaran, H., and Rao, J. S. (2005), “Spike and Slab Variable Selection: Frequentist and Bayesian Strategies,” The Annals of Statistics, 33, 730–773. DOI: 10.1214/009053604000001147.
Web of Science ®Google Scholar
Kuo, L., and Mallick, B. K. (1998), “Variable Selection for Regression Models,” Sankhya Series B 60, 65–81.
Google Scholar
Marinari, E., and Parisi, G. (1992), “Simulated Tempering: A New Monte Carlo Scheme,” EPL (Europhysics Letters), 19, 451. DOI: 10.1209/0295-5075/19/6/002.
Google Scholar
Markovic, J., and Sepehri, A. (2018), “Bouncy Hybrid Sampler as a Unifying Device,” arXiv:1802.04366 .
Google Scholar
Michel, M., Durmus, A., and Sénécal, S. (2020), “Forward Event-Chain Monte Carlo: Fast Sampling by Randomness Control in Irreversible Markov Chains,” Journal of Computational and Graphical Statistics, 29, 689–702. DOI: 10.1080/10618600.2020.1750417.
Web of Science ®Google Scholar
Michel, M., Kapfer, S. C., and Krauth, W. (2014), “Generalized Event-Chain Monte Carlo: Constructing Rejection-Free Global-Balance Algorithms from Infinitesimal Steps,” The Journal of Chemical Physics, 140, 054116. DOI: 10.1063/1.4863991.
PubMed Web of Science ®Google Scholar
Mitchell, T. J., and Beauchamp, J. J. (1988), “Bayesian Variable Selection in Linear Regression,” Journal of the American Statistical Association, 83, 1023–1032. DOI: 10.1080/01621459.1988.10478694.
Web of Science ®Google Scholar
Neal, R. M. (2011), “MCMC using Hamiltonian Dynamics,” in Handbook of Markov Chain Monte Carlo, eds. A. Brooks, A. Gelman, G. L. Jones and X. Meng, Boca Raton, FL: pp. 113–162, Chapman & Hall/CRC.
Google Scholar
Peters, E. A., and de With, G. (2012), “Rejection-Free Monte Carlo Sampling for General Potentials,” Physical Review E, 85, 026703. DOI: 10.1103/PhysRevE.85.026703.
PubMed Web of Science ®Google Scholar
Phillips, D. B., and Smith, A. F. (1996), “Bayesian Model Comparison via Jump Diffusions,” in Markov Chain Monte Carlo in Practice, eds. W. R. Gilks, S. Richardson and D. Spiegelhalter, pp. 215–240, Boca Raton, FL: Chapman & Hall, CRC.
Google Scholar
Piironen, J., and Vehtari, A. (2017a), “On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior,” Vol. 54 of Proceedings of Machine Learning Research, PMLR, pp. 905–913.
Google Scholar
Piironen, J., and Vehtari, A. (2017b), “Sparsity Information and Regularization in the Horseshoe and other Shrinkage Priors,” Electronic Journal of Statistics, 11, 5018–5051.
Web of Science ®Google Scholar
Polson, N. G., Scott, J. G., and Windle, J. (2013), “Bayesian Inference for Logistic Models using Pólya–Gamma Latent Variables,” Journal of the American Statistical Association, 108, 1339–1349. DOI: 10.1080/01621459.2013.829001.
Web of Science ®Google Scholar
Power, S., and Goldman, J. V. (2019), “Accelerated Sampling on Discrete Spaces with Non-reversible Markov Processes,” arXiv:1912.04681.
Google Scholar
Sherlock, C., and Thiery, A. H. (2020), “A Discrete Bouncy Particle Sampler,” Biometrika (to appear). DOI: 10.1093/biomet/asab013.
Web of Science ®Google Scholar
Smith, M., and Kohn, R. (1996), “Nonparametric Regression using Bayesian Variable Selection,” Journal of Econometrics, 75, 317–343. DOI: 10.1016/0304-4076(95)01763-1.
Web of Science ®Google Scholar
Song, Q., Sun, Y., Ye, M., and Liang, F. (2020), “Extended Stochastic Gradient Markov Chain Monte Carlo for Large-Scale Bayesian Variable Selection,” Biometrika, 107, 997–1004.
PubMed Web of Science ®Google Scholar
Stephens, M. (2000), “Bayesian Analysis of Mixture Models with an Unknown Number of Components-an Alternative to Reversible Jump Methods,” Annals of Statistics, 28, 40–74.
Web of Science ®Google Scholar
Vanetti, P., Bouchard-Côté, A., Deligiannidis, G., and Doucet, A. (2017), “Piecewise-Deterministic Markov Chain Monte Carlo,” arXiv:1707.05296.
Google Scholar
Wang, S., Nan, B., Rosset, S., and Zhu, J. (2011), “Random Lasso,” Annals of Applied Statistics, 5, 468–485.
PubMed Web of Science ®Google Scholar
Wu, C., and Robert, C. P. (2020), “Coordinate Sampler: A Non-reversible Gibbs-like MCMC Sampler,” Statistics and Computing, 30, 721–730. DOI: 10.1007/s11222-019-09913-w.
Web of Science ®Google Scholar
Yang, Y., Wainwright, M. J., and Jordan, M. I. (2016), “On the Computational Complexity of High-Dimensional Bayesian Variable Selection,” The Annals of Statistics, 44, 2497–2532. DOI: 10.1214/15-AOS1417.
Web of Science ®Google Scholar
Zanella, G., and Roberts, G. (2019), “Scalable Importance Tempering and Bayesian Variable Selection,” Journal of the Royal Statistical Society, Series B, 81, 489–517. DOI: 10.1111/rssb.12316.
Google Scholar

Reversible Jump PDMP Samplers for Variable Selection

Abstract

1 Introduction