342
Views
2
CrossRef citations to date
0
Altmetric
Original Articles

Selecting multiple web adverts: A contextual multi-armed bandit with state uncertainty

ORCID Icon & ORCID Icon
Pages 100-116 | Received 06 Oct 2017, Accepted 08 Oct 2018, Published online: 20 Feb 2019
 

Abstract

We present a method to solve the problem of choosing a set of adverts to display to each of a sequence of web users. The objective is to maximise user clicks over time and to do so we must learn about the quality of each advert in an online manner by observing user clicks. We formulate the problem as a novel variant of a contextual combinatorial multi-armed bandit problem. The context takes the form of a probability distribution over the user's latent topic preference, and rewards are a particular nonlinear function of the selected set and the context. These features ensure that optimal sets of adverts are appropriately diverse. We give a flexible solution method which combines submodular optimisation with existing bandit index policies. User state uncertainty creates ambiguity in interpreting user feedback which prohibits exact Bayesian updating, but we give an approximate method that is shown to work well.

Acknowledgements

This work was funded by a Google Faculty Research Award. James Edwards was supported by the EPSRC funded EP/H023151/1 STOR-i CDT.

Disclosure statement

No potential conflict of interest was reported by the authors.

Appendix A: Derivations for Section 4

A1. Updating equations for arm weights

PCM. The joint distribution for all weights given a feedback step is updated as given below. In the following note that p(·|q,x) simplifies to p(·|x) and that p(wA,x|q,A)=p(wA)qx since x is independent of wA. The posterior belief for the weights after a user action is, p(wA|y,m*,q,A)=x=1np(wA,x|y,m*,q,A)=x=1np(y,m*|wA,x,q,A)p(wA,x|q,A)p(y,m*|q,A)=1p(y,m*|q,A)x=1n{[(wam*,x)yaA(1wa,x)]p(wA)qx}=p(wA)p(y,m*|q,A)x=1n[qx(wam*,x)yaA(1wa,x)].

TCM. The updating equation for wA for TCM is similar to that for PCM except that p(y,m*|wA,x,q) changes due to the user threshold u: (A1) p(wA|y,m*,q,A)=x=1np(wA,x|y,m*,q,A)=x=1np(y,m*|wA,x,q,A)p(wA,x|q,A)p(y,m*|q,A)=p(wA)p(y,m*|q,A)u=01x=1n[qx(1{wam*,x>u})yaA1{wa,xu}]du.(A1)

A2. Derivation of q˜

The posterior q˜=(q˜1,,q˜n) depends on WA, y, m*, q and A. For ease of reading the rest of this section will use w and W to respectively stand for wA and WA. Bayes Theorem will be used to condition the outcome on x which allows the use of the conditional independence of arms under PCM to factorise to a simple formula. q˜x=p(x|W,y,m*,q,A)=p(x,w|W,y,m*,a,A)dw=p(x|w,y,m*,q,A)p(w|W,y,m*,q,A)dw=p(y,m*|x,w,q,A)p(x|w,q,A)p(y,m*|w,W,q,A)p(w|W,y,m*,q,A)dw,

Then, substituting in p(w|W,y,m*,q,A)=p(w,y,m*|W,q,A)p(y,m*|W,q,A)=p(y,m*|w,W,q,A)p(w|W)p(y,m*|W,q,A), and cancelling gives (A2) q˜x=p(y,m*|x,w,q,A)p(x|w,q,A)p(w|W)p(y,m*|W,q,A)dw=qxp(y,m*|x,w,q,A)p(w|W)x˜qx˜p(y,m*|x˜,w˜,q,A)p(w˜|W)dw˜dw,(A2)

where the last step uses p(x|w,q,A)=p(x|q)=qx.

It remains to find p(y,m*|x,w,q,A)p(w|W)dw. Under PCM this is easily found since, given x, the probability of clicking any arm a considered by the user is the same as its independent click probability (as though it were the only arm in the set) and is independent from all weights except wa,x. That is, for a single arm a, (A3) p(y,m*|x,w,q,A)p(w|W)dw=p(y,m*|x,wa,x,q,A)p(wa,x|Wa,x)dwa,x=(μa,x)y(1μa,x)(1y),(A3) where μa,x=αa,xαa,x+βa,x is the expectation of Wa,x. Under PCM, the outcome of any arm, given it is considered by the user, is independent of the other arms so (A2) and (A3) can be combined to give q˜x=qx(μam*,x)yaA(1μa,x)j=1n[qj(μam*,x)yaA(1μa,x)].

A3. Updating for TCM with known x

Adapting (A1) in Section A1, the updating for known x under TCM is p(wA|y,m*,x)=p(wA)p(y,m*|q)u=01qx(1{wam*,x>u})yaA1{wa,xu}du=qxp(wA)p(y,m*|q)u=01(1{wam*,x>u})y1{u>maxaA(wa,x)}du=qxp(wA)p(y,m*|q)[(wam*,x)ymaxaA(wa,x)].

Appendix B: Lemma B.1

The following lemma is used in the proof of Theorem 5.1 which is given in Appendix C. Both this lemma and the proof of Theorem 5.1 use the following notation.

Let RTS(a,Wa,t,qt|It)=qt·w˜a,t denote the stochastic index for the multiple action Thompson sampling policy for a single arm a where each w˜a,t,xWa,t,x. Then under SEQ the arm chosen in slot one is the one with the highest index: at,1=argmaxaARTS(a,Wa,t,qt|It).

Lemma B.1.

Let τa,T be the set of times t=1,,T at which aAt. Let q*=inft,xPr(qt,x=1) and w*=maxaA,xwa,x and, from these, set η=q*(1w*)m. If q*>0 then under the deterministic updating scheme given in Section 4.2 using any click model from Section 3.2, Pr(RTS(a,Wa,T,qT|It)11+ηδ1+δ2)1 as |τa,T| for any aA and any δ1, δ2 such that η>δ1>0 and δ2>0.

Proof.

For any aA,x=1,,n we will give bounds for expected rate at which αa,t,x and βa,t,x increase as the arm a is selected over time (an upper bound for αa,t,x and a lower bound for βa,t,x). This will give an asymptotic upper bound less than 1 on each posterior mean μa,t,x=E[Wa,t,x] as |τa,t|. Showing that Var(Wa,t,x)0 as |τa,t| then gives the required result. Throughout, a is an arbitrary arm in A and x an arbitrary state in {1,,n}.

Let αa,0,x and βa,0,x be values of the parameters of the Beta prior placed on wa,x, then an upper bound for αa,T,x,T1 is simply (B1) αa,T,xαa,0,x+|τa,T|(B1) since αa,T,x can only increase by at most one at times when when aAt and is unchanging at other times.

For a lower bound on E[βa,T,x] we consider only times when aAt,qt,x=1 and yt = 0. Then yt = 0 guarantees that arm a is considered by the user and qt,x=1 means the failure to click can be attributed to wa,x. Hence, for t1, (B2) βa,t+1,x|(qt,x=1,yt=0,aAt,βa,t,x)=βa,t,x+1.(B2)

At all times βa,t+1,xβa,t,x since the β parameters cannot decrease. For PCM, Pr(yt=0|qt,x=1,At,wAt)=bAt(1wb,x) which is no larger than the corresponding probability for TCM. The probability that yt = 0 can therefore be bounded below. Let w*=maxbA,x wb,x and q*=mint,x Pr(qt,x=1) then for any AtA, (B3) Pr(yt=0|At,wA)q*(1w*)m.(B3)

We can now give a lower bound on E[βa,T,x|I1] where the expectation is joint over all qt, yt, mt* for t=1,,T, and I1 is just the priors for W. Using (B2) and (B3), we have at any time T, (B4) E[βa,T,x|I1]βa,0,x+tτa,T[Pr(qt,x=1)Pr(yt=0|qt,x=1,aAt,wAt)]|τa,T|q*(1w*)m.(B4)

Let η=q*(1w*)m and note that η>0 since w*<1 by the problem definition and q*>0 by the assumption given in the statement of the Lemma. Combining (B1) and (B4) gives, for any τa,T, E[βa,T,xαa,T,x|I1]1αa,0,x+|τa,T|E[βa,T,x|I1]|τa,T|ηαa,0,x+|τa,T| and so by the strong law of large numbers, for sufficiently large |τa,T| and conditional on I1, (B5) βa,T,xαa,T,x|τa,T|ηαa,0,x+|τa,T|η.(B5)

Note that μa,T,x=αa,T,xαa,T,x+βa,T,x=11+βa,T,xαa,T,x, and so from (B5), (B6) Pr(μa,T,x11+ηδ1)1 as |τa,T|(B6) for any δ1 such that η>δ1>0.

Then, using the variance of a Beta distribution and (B4) we have Var(Wa,T,x)=αa,T,xβa,T,x(αa,T,x+βa,T,x)2(αa,T,x+βa,T,x+1)<(αa,T,x+βa,T,x)2(αa,T,x+βa,T,x)2(αa,T,x+βa,T,x+1)=1(αa,T,x+βa,T,x+1)0 as |τa,T|, and so for any δ2>0 the sampled w˜a,T,xWa,T,x satisfy (B7) Pr(w˜a,T,xμa,T,x+δ2)1 as |τa,T|.(B7)

By definition RTS(a,Wa,t,qt|It)=x=1n(qt,xw˜a,t,x)maxx w˜a,t,x where w˜a,t,xWa,t,x. Therefore, to complete the proof it is sufficient that Pr(w˜a,T,x<1/(1+ηδ1)+δ2)1 as |τa,T| for all aA,x=1,n and any δ1, δ2 such that η>δ1>0 and δ2>0, which follows from (B6) and (B7). □

Appendix C: Proof of Theorem 5.1

The proof will assume that there is a non-empty set of arms AFA whose members are sampled finitely often as t and show that this leads to a contradiction. Under this assumption bAF|τb,|< and so there exists a finite time M=maxbAFτb,t even as t.

Let AI=AAF be the set of arms sampled infinitely often (which must be non-empty). Let w*=maxaA,xwa,x and η=q*(1w*)m as in the proof of Lemma B.1. Note that η>0 since w*<1 by the problem definition and q*>0 by the given condition. Then fix some 0<δ1<η and 0<δ2<11/(1+ηδ1). Then by Lemma B.1 for all aAI, Pr(RTS(a,Wa,t,qt)11+ηδ1+δ2)1 as t.

So there exists a finite random time T > M such that (C1) Pr(RTS(a,Wa,t,qt)11+ηδ1+δ2)>1δ2 for t>T, ∀aAI.(C1)

Let ϵ=minbAF[Pr(RTS(b,Wb,T,qT|IT)>1/(1+ηδ1)+δ2)]. Then for all t > T, bAF we have (C2) Pr(RTS(b,Wb,t,qt|It)>11+ηδ1+δ2)ϵ,(C2) since no arm in AF is selected at times t > T > M and so Wb,t is unchanged over these times. We know that ϵ>0 since Pr(w˜b,T,x>1/(1+η+δ1)+δ2)>0 for all b, x because 1/(1+ηδ1)+δ2<1 and Wb,T,x is a Beta distribution with support (0, 1).

Combining (C1) and (C2), (C3) Pr[RTS(b,Wb,t,qt|It)>RTS(a,Wa,t,qt|It),aA]>ϵ(1δ2)(C3) for all t > T. Therefore t=TPr(bAt for some bAF)>t=Tϵ(1δ2)|AI|=.

Using the Extended Borel–Cantelli Lemma (Corollary 5.29 of Breiman, Citation1992) it follows that bAF|τb,|= which contradicts the assumption that |τb,| is finite for all bAF. Therefore some arm in AF is selected infinitely often and since AF was of arbitrary size it follows that AF=.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 277.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.