Selecting multiple web adverts: A contextual multi-armed bandit with state uncertainty: Journal of the Operational Research Society: Vol 71 , No 1

Abstract

We present a method to solve the problem of choosing a set of adverts to display to each of a sequence of web users. The objective is to maximise user clicks over time and to do so we must learn about the quality of each advert in an online manner by observing user clicks. We formulate the problem as a novel variant of a contextual combinatorial multi-armed bandit problem. The context takes the form of a probability distribution over the user's latent topic preference, and rewards are a particular nonlinear function of the selected set and the context. These features ensure that optimal sets of adverts are appropriately diverse. We give a flexible solution method which combines submodular optimisation with existing bandit index policies. User state uncertainty creates ambiguity in interpreting user feedback which prohibits exact Bayesian updating, but we give an approximate method that is shown to work well.

Keywords:

Acknowledgements

This work was funded by a Google Faculty Research Award. James Edwards was supported by the EPSRC funded EP/H023151/1 STOR-i CDT.

Disclosure statement

No potential conflict of interest was reported by the authors.

Appendix A: Derivations for Section 4

A1. Updating equations for arm weights

PCM. The joint distribution for all weights given a feedback step is updated as given below. In the following note that $p (\cdot | q, x)$ simplifies to $p (\cdot | x)$ and that $p (w_{A}, x | q, A) = p (w_{A}) q_{x}$ since x is independent of $w_{A}$ . The posterior belief for the weights after a user action is, $\begin{array}{l} p (w_{A} | y, m^{*}, q, A) = \sum_{x = 1}^{n} p (w_{A}, x | y, m^{*}, q, A) \\ = \sum_{x = 1}^{n} \frac{p (y, m^{*} | w_{A}, x, q, A) p (w_{A}, x | q, A)}{p (y, m^{*} | q, A)} \\ = \frac{1}{p (y, m^{*} | q, A)} \sum_{x = 1}^{n} {[{(w_{a_{m^{*}}, x})}^{y} \prod_{a \in A'} (1 - w_{a, x})] p (w_{A}) q_{x}} \\ = \frac{p (w_{A})}{p (y, m^{*} | q, A)} \sum_{x = 1}^{n} [q_{x} {(w_{a_{m^{*}}, x})}^{y} \prod_{a \in A'} (1 - w_{a, x})] . \end{array}$

TCM. The updating equation for $w_{A}$ for TCM is similar to that for PCM except that $p (y, m^{*} | w_{A}, x, q)$ changes due to the user threshold u: (A1) $\begin{array}{l} p (w_{A} | y, m^{*}, q, A) = \sum_{x = 1}^{n} p (w_{A}, x | y, m^{*}, q, A) \\ = \sum_{x = 1}^{n} \frac{p (y, m^{*} | w_{A}, x, q, A) p (w_{A}, x | q, A)}{p (y, m^{*} | q, A)} \\ = \frac{p (w_{A})}{p (y, m^{*} | q, A)} \int_{u = 0}^{1} \sum_{x = 1}^{n} [q_{x} {(1_{{w_{a_{m^{*}}, x} > u}})}^{y} \prod_{a \in A'} 1_{{w_{a, x} \leq u}}] d u . \end{array}$ (A1)

A2. Derivation of $\tilde{q}$

The posterior $\tilde{q} = ({\tilde{q}}_{1}, \dots, {\tilde{q}}_{n})$ depends on $W_{A}$ , y, $m^{*}$ , q and A. For ease of reading the rest of this section will use w and W to respectively stand for $w_{A}$ and $W_{A}$ . Bayes Theorem will be used to condition the outcome on x which allows the use of the conditional independence of arms under PCM to factorise to a simple formula. $\begin{array}{l} {\tilde{q}}_{x} = p (x | W, y, m^{*}, q, A) \\ = \int p (x, w | W, y, m^{*}, a, A) d w \\ = \int p (x | w, y, m^{*}, q, A) p (w | W, y, m^{*}, q, A) d w \\ = \int \frac{p (y, m^{*} | x, w, q, A) p (x | w, q, A)}{p (y, m^{*} | w, W, q, A)} p (w | W, y, m^{*}, q, A) d w, \end{array}$

Then, substituting in $\begin{array}{l} p (w | W, y, m^{*}, q, A) = \frac{p (w, y, m^{*} | W, q, A)}{p (y, m^{*} | W, q, A)} \\ = \frac{p (y, m^{*} | w, W, q, A) p (w | W)}{p (y, m^{*} | W, q, A)}, \end{array}$ and cancelling gives (A2) $\begin{array}{l} {\tilde{q}}_{x} = \int \frac{p (y, m^{*} | x, w, q, A) p (x | w, q, A) p (w | W)}{p (y, m^{*} | W, q, A)} d w \\ = q_{x} \int \frac{p (y, m^{*} | x, w, q, A) p (w | W)}{\sum_{\tilde{x}} q_{\tilde{x}} \int p (y, m^{*} | \tilde{x}, \tilde{w}, q, A) p (\tilde{w} | W) d \tilde{w}} d w, \end{array}$ (A2)

where the last step uses $p (x | w, q, A) = p (x | q) = q_{x}$ .

It remains to find $\int p (y, m^{*} | x, w, q, A) p (w | W) d w$ . Under PCM this is easily found since, given x, the probability of clicking any arm a considered by the user is the same as its independent click probability (as though it were the only arm in the set) and is independent from all weights except $w_{a, x}$ . That is, for a single arm a, (A3) $\begin{array}{l} \int p (y, m^{*} | x, w, q, A) p (w | W) d w = \int p (y, m^{*} | x, w_{a, x}, q, A) p (w_{a, x} | W_{a, x}) d w_{a, x} \\ = {(μ_{a, x})}^{y} {(1 - μ_{a, x})}^{(1 - y)}, \end{array}$ (A3) where $μ_{a, x} = \frac{α_{a, x}}{α_{a, x} + β_{a, x}}$ is the expectation of $W_{a, x}$ . Under PCM, the outcome of any arm, given it is considered by the user, is independent of the other arms so (A2) and (A3) can be combined to give ${\tilde{q}}_{x} = q_{x} \frac{{(μ_{a_{m^{*}}, x})}^{y} \prod_{a \in A^{'}} (1 - μ_{a, x})}{\sum_{j = 1}^{n} [q_{j} {(μ_{a_{m^{*}}, x})}^{y} \prod_{a \in A^{'}} (1 - μ_{a, x})]} .$

A3. Updating for TCM with known x

Adapting (A1) in Section A1, the updating for known x under TCM is $\begin{array}{l} p (w_{A} | y, m^{*}, x) = \frac{p (w_{A})}{p (y, m^{*} | q)} \int_{u = 0}^{1} q_{x} {(1_{{w_{a_{m^{*}}, x} > u}})}^{y} \prod_{a \in A'} 1_{{w_{a, x} \leq u}} d u \\ = \frac{q_{x} p (w_{A})}{p (y, m^{*} | q)} \int_{u = 0}^{1} {(1_{{w_{a_{m^{*}}, x} > u}})}^{y} 1_{{u > \max_{a \in A'} (w_{a, x})}} d u \\ = \frac{q_{x} p (w_{A})}{p (y, m^{*} | q)} [{(w_{a_{m^{*}}, x})}^{y} - \max_{a \in A'} (w_{a, x})] . \end{array}$

Appendix B: Lemma B.1

The following lemma is used in the proof of Theorem 5.1 which is given in Appendix C. Both this lemma and the proof of Theorem 5.1 use the following notation.

Let $R^{T S} (a, W_{a, t}, q_{t} | I_{t}) = q_{t} \cdot {\tilde{w}}_{a, t}$ denote the stochastic index for the multiple action Thompson sampling policy for a single arm a where each ${\tilde{w}}_{a, t, x} \sim W_{a, t, x}$ . Then under SEQ the arm chosen in slot one is the one with the highest index: $a_{t, 1} = {argmax}_{a \in A} R^{T S} (a, W_{a, t}, q_{t} | I_{t})$ .

Lemma B.1.

Let $τ_{a, T}$ be the set of times $t = 1, \dots, T$ at which $a \in A_{t}$ . Let $q^{*} = \inf_{t, x} \Pr (q_{t, x} = 1)$ and $w^{*} = \max_{a \in A, x} w_{a, x}$ and, from these, set $η = q^{*} {(1 - w^{*})}^{m}$ . If $q^{*} > 0$ then under the deterministic updating scheme given in Section 4.2 using any click model from Section 3.2, $\Pr (R^{T S} (a, W_{a, T}, q_{T} | I_{t}) \leq \frac{1}{1 + η - δ_{1}} + δ_{2}) \to 1 as | τ_{a, T} | \to \infty$ for any $a \in A$ and any δ₁, δ₂ such that $η > δ_{1} > 0$ and $δ_{2} > 0$ .

Proof.

For any $a \in A, x = 1, \dots, n$ we will give bounds for expected rate at which $α_{a, t, x}$ and $β_{a, t, x}$ increase as the arm a is selected over time (an upper bound for $α_{a, t, x}$ and a lower bound for $β_{a, t, x}$ ). This will give an asymptotic upper bound less than 1 on each posterior mean $μ_{a, t, x} = E [W_{a, t, x}]$ as $| τ_{a, t} | \to \infty$ . Showing that $V a r (W_{a, t, x}) \to 0$ as $| τ_{a, t} | \to \infty$ then gives the required result. Throughout, a is an arbitrary arm in $A$ and x an arbitrary state in ${1, \dots, n}$ .

Let $α_{a, 0, x}$ and $β_{a, 0, x}$ be values of the parameters of the Beta prior placed on $w_{a, x}$ , then an upper bound for $α_{a, T, x}, T \geq 1$ is simply (B1) $α_{a, T, x} \leq α_{a, 0, x} + | τ_{a, T} |$ (B1) since $α_{a, T, x}$ can only increase by at most one at times when when $a \in A_{t}$ and is unchanging at other times.

For a lower bound on $E [β_{a, T, x}]$ we consider only times when $a \in A_{t}, q_{t, x} = 1$ and y_t = 0. Then y_t = 0 guarantees that arm a is considered by the user and $q_{t, x} = 1$ means the failure to click can be attributed to $w_{a, x}$ . Hence, for $t \geq 1$ , (B2) $β_{a, t + 1, x} | (q_{t, x} = 1, y_{t} = 0, a \in A_{t}, β_{a, t, x}) = β_{a, t, x} + 1.$ (B2)

At all times $β_{a, t + 1, x} \geq β_{a, t, x}$ since the β parameters cannot decrease. For PCM, $\Pr (y_{t} = 0 | q_{t, x} = 1, A_{t}, w_{A_{t}}) = \prod_{b \in A_{t}} (1 - w_{b, x})$ which is no larger than the corresponding probability for TCM. The probability that y_t = 0 can therefore be bounded below. Let $w^{*} = \max_{b \in A, x} w_{b, x}$ and $q^{*} = \min_{t, x} \Pr (q_{t, x} = 1)$ then for any $A_{t} \subset A$ , (B3) $\Pr (y_{t} = 0 | A_{t}, w_{A}) \geq q^{*} {(1 - w^{*})}^{m} .$ (B3)

We can now give a lower bound on $E [β_{a, T, x} | I_{1}]$ where the expectation is joint over all $q_{t}$ , y_t, $m_{t}^{*}$ for $t = 1, \dots, T$ , and I₁ is just the priors for W. Using (B2) and (B3), we have at any time T, (B4) $\begin{array}{l} E [β_{a, T, x} | I_{1}] \geq β_{a, 0, x} + \sum_{t \in τ_{a, T}} [\Pr (q_{t, x} = 1) \Pr (y_{t} = 0 | q_{t, x} = 1, a \in A_{t}, w_{A_{t}})] \\ \geq | τ_{a, T} | q^{*} {(1 - w^{*})}^{m} . \end{array}$ (B4)

Let $η = q^{*} {(1 - w^{*})}^{m}$ and note that $η > 0$ since $w^{*} < 1$ by the problem definition and $q^{*} > 0$ by the assumption given in the statement of the Lemma. Combining (B1) and (B4) gives, for any $τ_{a, T}$ , $\begin{matrix} E [\frac{β_{a, T, x}}{α_{a, T, x}} | I_{1}] \geq \frac{1}{α_{a, 0, x} + | τ_{a, T} |} E [β_{a, T, x} | I_{1}] \\ \geq \frac{| τ_{a, T} | η}{α_{a, 0, x} + | τ_{a, T} |} \end{matrix}$ and so by the strong law of large numbers, for sufficiently large $| τ_{a, T} |$ and conditional on I₁, (B5) $\frac{β_{a, T, x}}{α_{a, T, x}} \geq \frac{| τ_{a, T} | η}{α_{a, 0, x} + | τ_{a, T} |} \to η .$ (B5)

Note that $μ_{a, T, x} = \frac{α_{a, T, x}}{α_{a, T, x} + β_{a, T, x}} = \frac{1}{1 + \frac{β_{a, T, x}}{α_{a, T, x}}},$ and so from (B5), (B6) $\Pr (μ_{a, T, x} \leq \frac{1}{1 + η - δ_{1}}) \to 1 as | τ_{a, T} | \to \infty$ (B6) for any δ₁ such that $η > δ_{1} > 0$ .

Then, using the variance of a Beta distribution and (B4) we have $\begin{array}{l} V a r (W_{a, T, x}) = \frac{α_{a, T, x} β_{a, T, x}}{{(α_{a, T, x} + β_{a, T, x})}^{2} (α_{a, T, x} + β_{a, T, x} + 1)} \\ < \frac{{(α_{a, T, x} + β_{a, T, x})}^{2}}{{(α_{a, T, x} + β_{a, T, x})}^{2} (α_{a, T, x} + β_{a, T, x} + 1)} \\ = \frac{1}{(α_{a, T, x} + β_{a, T, x} + 1)} \to 0 as | τ_{a, T} | \to \infty, \end{array}$ and so for any $δ_{2} > 0$ the sampled ${\tilde{w}}_{a, T, x} \sim W_{a, T, x}$ satisfy (B7) $\Pr ({\tilde{w}}_{a, T, x} \leq μ_{a, T, x} + δ_{2}) \to 1 as | τ_{a, T} | \to \infty .$ (B7)

By definition $R^{T S} (a, W_{a, t}, q_{t} | I_{t}) = \sum_{x = 1}^{n} (q_{t, x} {\tilde{w}}_{a, t, x}) \leq \max_{x} {\tilde{w}}_{a, t, x}$ where ${\tilde{w}}_{a, t, x} \sim W_{a, t, x}$ . Therefore, to complete the proof it is sufficient that $\Pr ({\tilde{w}}_{a, T, x} < 1 / (1 + η - δ_{1}) + δ_{2}) \to 1$ as $| τ_{a, T} | \to \infty$ for all $a \in A, x = 1, \dots n$ and any δ₁, δ₂ such that $η > δ_{1} > 0$ and $δ_{2} > 0$ , which follows from (B6) and (B7). □

Appendix C: Proof of Theorem 5.1

The proof will assume that there is a non-empty set of arms $A_{F} \subset A$ whose members are sampled finitely often as $t \to \infty$ and show that this leads to a contradiction. Under this assumption $\sum_{b \in A_{F}} | τ_{b, \infty} | < \infty$ and so there exists a finite time $M = \max_{b \in A_{F}} τ_{b, t}$ even as $t \to \infty$ .

Let $A_{I} = A ∖ A_{F}$ be the set of arms sampled infinitely often (which must be non-empty). Let $w^{*} = \max_{a \in A, x} w_{a, x}$ and $η = q^{*} {(1 - w^{*})}^{m}$ as in the proof of Lemma B.1. Note that $η > 0$ since $w^{*} < 1$ by the problem definition and $q^{*} > 0$ by the given condition. Then fix some $0 < δ_{1} < η$ and $0 < δ_{2} < 1 - 1 / (1 + η - δ_{1})$ . Then by Lemma B.1 for all $a \in A_{I}$ , $\Pr (R^{T S} (a, W_{a, t}, q_{t}) \leq \frac{1}{1 + η - δ_{1}} + δ_{2}) \to 1 as t \to \infty .$

So there exists a finite random time T > M such that (C1) $\Pr (R^{T S} (a, W_{a, t}, q_{t}) \leq \frac{1}{1 + η - δ_{1}} + δ_{2}) > 1 - δ_{2} for t > T, \forall a \in A_{I} .$ (C1)

Let $ϵ = \min_{b \in A_{F}} [\Pr (R^{T S} (b, W_{b, T}, q_{T} | I_{T}) > 1 / (1 + η - δ_{1}) + δ_{2})]$ . Then for all t > T, $b \in A_{F}$ we have (C2) $\Pr (R^{T S} (b, W_{b, t}, q_{t} | I_{t}) > \frac{1}{1 + η - δ_{1}} + δ_{2}) \geq ϵ,$ (C2) since no arm in $A_{F}$ is selected at times t > T > M and so $W_{b, t}$ is unchanged over these times. We know that $ϵ > 0$ since $\Pr ({\tilde{w}}_{b, T, x} > 1 / (1 + η + δ_{1}) + δ_{2}) > 0$ for all b, x because $1 / (1 + η - δ_{1}) + δ_{2} < 1$ and $W_{b, T, x}$ is a Beta distribution with support (0, 1).

Combining (C1) and (C2), (C3) $\Pr [R^{T S} (b, W_{b, t}, q_{t} | I_{t}) > R^{T S} (a, W_{a, t}, q_{t} | I_{t}), \forall a \in A] > ϵ (1 - δ_{2})$ (C3) for all t > T. Therefore $\sum_{t = T}^{\infty} \Pr (b \in A_{t} for some b \in A_{F}) > \sum_{t = T}^{\infty} ϵ {(1 - δ_{2})}^{| A_{I} |} = \infty .$

Using the Extended Borel–Cantelli Lemma (Corollary 5.29 of Breiman, Citation1992) it follows that $\sum_{b \in A_{F}} | τ_{b, \infty} | = \infty$ which contradicts the assumption that $| τ_{b, \infty} |$ is finite for all $b \in A_{F}$ . Therefore some arm in $A_{F}$ is selected infinitely often and since $A_{F}$ was of arbitrary size it follows that $A_{F} = \emptyset$ .

Selecting multiple web adverts: A contextual multi-armed bandit with state uncertainty

A1. Updating equations for arm weights

A2. Derivation of $\tilde{q}$

A3. Updating for TCM with known x

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Selecting multiple web adverts: A contextual multi-armed bandit with state uncertainty

Abstract

Acknowledgements

Disclosure statement

Appendix A: Derivations for Section 4

A1. Updating equations for arm weights

A2. Derivation of q˜

A3. Updating for TCM with known x

Appendix B: Lemma B.1

Appendix C: Proof of Theorem 5.1

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature

A2. Derivation of $\tilde{q}$