Abstract
We present a method to solve the problem of choosing a set of adverts to display to each of a sequence of web users. The objective is to maximise user clicks over time and to do so we must learn about the quality of each advert in an online manner by observing user clicks. We formulate the problem as a novel variant of a contextual combinatorial multi-armed bandit problem. The context takes the form of a probability distribution over the user's latent topic preference, and rewards are a particular nonlinear function of the selected set and the context. These features ensure that optimal sets of adverts are appropriately diverse. We give a flexible solution method which combines submodular optimisation with existing bandit index policies. User state uncertainty creates ambiguity in interpreting user feedback which prohibits exact Bayesian updating, but we give an approximate method that is shown to work well.
Acknowledgements
This work was funded by a Google Faculty Research Award. James Edwards was supported by the EPSRC funded EP/H023151/1 STOR-i CDT.
Disclosure statement
No potential conflict of interest was reported by the authors.
Appendix A: Derivations for Section 4
A1. Updating equations for arm weights
PCM. The joint distribution for all weights given a feedback step is updated as given below. In the following note that simplifies to
and that
since x is independent of
. The posterior belief for the weights after a user action is,
TCM. The updating equation for for TCM is similar to that for PCM except that
changes due to the user threshold u:
(A1)
(A1)
A2. Derivation of ![](//:0)
![](//:0)
The posterior depends on
, y,
, q and A. For ease of reading the rest of this section will use w and W to respectively stand for
and
. Bayes Theorem will be used to condition the outcome on x which allows the use of the conditional independence of arms under PCM to factorise to a simple formula.
Then, substituting in
and cancelling gives
(A2)
(A2)
where the last step uses .
It remains to find . Under PCM this is easily found since, given x, the probability of clicking any arm a considered by the user is the same as its independent click probability (as though it were the only arm in the set) and is independent from all weights except
. That is, for a single arm a,
(A3)
(A3)
where
is the expectation of
. Under PCM, the outcome of any arm, given it is considered by the user, is independent of the other arms so (A2) and (A3) can be combined to give
A3. Updating for TCM with known x
Adapting (A1) in Section A1, the updating for known x under TCM is
Appendix B: Lemma B.1
The following lemma is used in the proof of Theorem 5.1 which is given in Appendix C. Both this lemma and the proof of Theorem 5.1 use the following notation.
Let denote the stochastic index for the multiple action Thompson sampling policy for a single arm a where each
. Then under SEQ the arm chosen in slot one is the one with the highest index:
.
Lemma B.1.
Let be the set of times
at which
. Let
and
and, from these, set
. If
then under the deterministic updating scheme given in Section 4.2 using any click model from Section 3.2,
for any
and any δ1, δ2 such that
and
.
Proof.
For any we will give bounds for expected rate at which
and
increase as the arm a is selected over time (an upper bound for
and a lower bound for
). This will give an asymptotic upper bound less than 1 on each posterior mean
as
. Showing that
as
then gives the required result. Throughout, a is an arbitrary arm in
and x an arbitrary state in
.
Let and
be values of the parameters of the Beta prior placed on
, then an upper bound for
is simply
(B1)
(B1)
since
can only increase by at most one at times when when
and is unchanging at other times.
For a lower bound on we consider only times when
and yt = 0. Then yt = 0 guarantees that arm a is considered by the user and
means the failure to click can be attributed to
. Hence, for
,
(B2)
(B2)
At all times since the β parameters cannot decrease. For PCM,
which is no larger than the corresponding probability for TCM. The probability that yt = 0 can therefore be bounded below. Let
and
then for any
,
(B3)
(B3)
We can now give a lower bound on where the expectation is joint over all
, yt,
for
, and I1 is just the priors for W. Using (B2) and (B3), we have at any time T,
(B4)
(B4)
Let and note that
since
by the problem definition and
by the assumption given in the statement of the Lemma. Combining (B1) and (B4) gives, for any
,
and so by the strong law of large numbers, for sufficiently large
and conditional on I1,
(B5)
(B5)
Note that
and so from (B5),
(B6)
(B6)
for any δ1 such that
.
Then, using the variance of a Beta distribution and (B4) we have
and so for any
the sampled
satisfy
(B7)
(B7)
By definition where
. Therefore, to complete the proof it is sufficient that
as
for all
and any δ1, δ2 such that
and
, which follows from (B6) and (B7). □
Appendix C: Proof of Theorem 5.1
The proof will assume that there is a non-empty set of arms whose members are sampled finitely often as
and show that this leads to a contradiction. Under this assumption
and so there exists a finite time
even as
.
Let be the set of arms sampled infinitely often (which must be non-empty). Let
and
as in the proof of Lemma B.1. Note that
since
by the problem definition and
by the given condition. Then fix some
and
. Then by Lemma B.1 for all
,
So there exists a finite random time T > M such that
(C1)
(C1)
Let . Then for all t > T,
we have
(C2)
(C2)
since no arm in
is selected at times t > T > M and so
is unchanged over these times. We know that
since
for all b, x because
and
is a Beta distribution with support (0, 1).
Combining (C1) and (C2),
(C3)
(C3)
for all t > T. Therefore
Using the Extended Borel–Cantelli Lemma (Corollary 5.29 of Breiman, Citation1992) it follows that which contradicts the assumption that
is finite for all
. Therefore some arm in
is selected infinitely often and since
was of arbitrary size it follows that
.