459
Views
1
CrossRef citations to date
0
Altmetric
Articles

Robust sequential design for piecewise-stationary multi-armed bandit problem in the presence of outliers

ORCID Icon, , & ORCID Icon
Pages 122-133 | Received 03 Oct 2020, Accepted 10 Mar 2021, Published online: 12 Apr 2021

ABSTRACT

The multi-armed bandit (MAB) problem studies the sequential decision making in the presence of uncertainty and partial feedback on rewards. Its name comes from imagining a gambler at a row of slot machines who needs to decide the best strategy on the number of times as well as the orders to play each machine. It is a classic reinforcement learning problem which is fundamental to many online learning problems. In many practical applications of the MAB, the reward distributions may change at unknown time steps and the outliers (extreme rewards) often exist. Current sequential design strategies may struggle in such cases, as they tend to infer additional change points to fit the outliers. In this paper, we propose a robust change-detection upper confidence bound (RCD-UCB) algorithm which can distinguish the real change points from the outliers in piecewise-stationary MAB settings. We show that the proposed RCD-UCB algorithm can achieve a nearly optimal regret bound on the order of O(SKTlogT), where T is the number of time steps, K is the number of arms and S is the number of stationary segments. We demonstrate its superior performance compared to some state-of-the-art algorithms in both simulation experiments and real data analysis. (See https://github.com/woaishufenke/MAB_STRF.git for the codes used in this paper.)

This article is part of the following collections:
Special Issue on Experimental Design

1. Introduction

The multi-armed bandit (MAB) problem, originally introduced by Thompson (Citation1933), studies how a decision-maker adaptively selects one from a series of alternative arms based on the historical observations of each arm and receives a reward accordingly (Lai & Robbins, Citation1985). The MAB is a fundamental problem to many online learning applications, such as the online recommendation system (Li et al., Citation2016), the dynamic spectrum access in communication system (Alaya-Feki et al., Citation2008) and the computational advertisement (Buccapatnam et al., Citation2017). A common goal of the sequential designs, i.e. the bandit algorithms, is to minimise the regret of the decision maker, where the regret refers to the expectation of the difference between the total rewards collected by playing the arm with the highest expected rewards and the total rewards obtained by the algorithms (Auer, Citation2002). To achieve this goal, decision-makers need to make a trade-off between exploring the environment to find the most profitable arms and exploiting the current empirically best arms as often as possible.

There is a rich literature studying classic MAB problems (Lattimore & Szepesvári, Citation2020), including the stochastic bandit models (Lai & Robbins, Citation1985) and the adversarial bandit models (Auer, Cesa-Bianchi, Freund et al., Citation2002). The former assumes that the reward distributions of all arms are time-invariant, and the latter assumes that the reward distributions of all arms change adversarially at all time steps. Yet, neither of these two assumptions may be realistic in many real-world applications, where the reward distributions do vary with time but much less frequently compared to what the adversarial bandit model assumes (Alaya-Feki et al., Citation2008); Yu & Mannor, Citation2009). In this paper, we focus on such piecewise-stationary MAB problems where the reward distributions are piecewise-constant and may shift at some unknown time steps called the change points; that is, we focus on MAB problems with mean changes.

In the current literature, two major approaches are proposed for the piecewise-stationary bandit problems: the passively adaptive policies and the actively adaptive policies (Liu et al., Citation2018). The passively adaptive policies adapt to the changes via adjusting the weights on the rewards. Specifically, the discounted upper confidence bound (D-UCB) algorithm discounts the weights on the old rewards and thus allocates larger weights on the recent ones when computing the UCB index of each arm (Kocsis & Szepesvári, Citation2006). The D-UCB algorithm achieves a regret bound on the order of O(KSTlogT), where T is the number of time steps, K is the number of arms and S is the number of stationary segments (Garivier & Moulines, Citation2011). Based on the D-UCB, the Sliding-Window UCB (SW-UCB) algorithm chooses only the most recent τ rewards in computing the UCB index, which achieves a regret O(KSTlogT) (Garivier & Moulines, Citation2011). Other passively adaptive policies include the EXP3.S (Auer, Cesa-Bianchi, Freund et al., Citation2002), the SHIFTBAND (Auer, Citation2002) and the Rexp3 (Besbes et al., Citation2014) algorithms.

The actively adaptive policies monitor the reward distributions by a change-detection (CD) algorithm. Their bandit algorithms will be reset once a change point is detected. The actively adaptive policies often have better performance compared to the passively adaptive policies in practice (Cao et al., Citation2019). For detecting the change points, the CUSUM-UCB algorithm (Liu et al., Citation2018) adopts the CUSUM method, and the Adapt-EvE algorithm (Hartland et al., Citation2007) and the adaptive SW-UCL algorithm (Srivastava et al., Citation2014) employ the Page-Hinkley Test (Hinkley, Citation1971). Some other available actively adaptive policies include the Bayesian CD algorithm (Mellor & Shapiro, Citation2013), the windowed mean-shift detection algorithm (Yu & Mannor, Citation2009) and the EXP3.R algorithm (Allesiardo & Féraud, Citation2015). To our best knowledge, the Monitored-UCB (M-UCB) algorithm by Cao et al. (Citation2019) is the currently most efficient method. The M-UCB achieves a nearly optimal regret bound on the order of O(SKTlogT) without strong parametric assumptions, and performs very well in practical applications. In this paper, we use the M-UCB as the benchmark method.

Unexpected data changes may significantly affect the data quality, which can invalidate the classic MAB algorithms (Cao et al., Citation2019). Such data offsets may come from the trend changes or the outliers. Here an outlier means a data point with an unusually large or small value. In the current literature, most methods on non-stationary MAB problems only consider the former but ignore the latter (Allesiardo & Féraud, Citation2015). In many state-of-the-art algorithms on piecewise-stationary MAB problems (Cao et al., Citation2019; Kaufmann et al., Citation2012; Liu et al., Citation2018), once their CD algorithms identify a change point, the embedded bandit algorithms will reset the current optimal arms and try to learn new ones. Since they often cannot distinguish the real trend changes from large outliers, their CD algorithms tend to infer additional change points to fit the outliers. When there are many scattered outliers, the reset strategy adopted in these algorithms will greatly reduce their efficiency and increase the computational complexity. Specifically, the M-UCB algorithm (Cao et al., Citation2019) relies on comparing the statistical distances between data segments and the thresholds to test the significance of the trend offsets in the local data, which can be very sensitive to the outliers.

In this paper, we propose a robust change-detection upper confidence bound (RCD-UCB) algorithm which can distinguish the real change points from the outliers for non-stationary MAB problems. The proposed RCD-UCB algorithm includes a new trend offset detection using truncated loss functions to eliminate the impacts of outliers. We show that the RCD-UCB algorithm can achieve a nearly optimal regret bound on the order of O(SKTlogT) for piecewise-stationary MAB problems, which is of the same order as the bound in Cao et al. (Citation2019). We demonstrate the superior performance of the RCD-UCB via three simulation studies and a real data analysis on metalwork factory machining, where the proposed algorithm can significantly reduce the cumulative regrets compared to some currently popular methods.

The remainder of this paper is organised as follows. In Section 2, we first formulate the piecewise-stationary MAB problems, then introduce the proposed RCD-UCB algorithm, and finally discuss its theoretical regret bound. In Section 3, we demonstrate the superior performance of the RCD-UCB via three simulation studies and a real data example. Section 4 concludes this work and discusses some future works. A brief description of the classic UCB1 algorithm (Auer, Cesa-Bianchi, Freund et al., Citation2002) and all proofs are relegated to the Appendix.

2. The RCD-UCB algorithm for piecewise-stationary MAB problems in the presence of outliers

2.1. Problem formulation

In an MAB problem, denote K={1,,K} as the set of swing arms and T={1,,T} as the set of time slots. At each time slot tT, the learning agent chooses an arm AtK and gets a reward XAt,t[0,1] which can be generalised to any bounded interval. The reward sequence {Xk,t}t=1,,T for the arm kK can be seen as a series of independent random variables from potentially different distributions. Let E(Xk,t) be the expectation of reward Xk,t at the time slot t. Let kt be the selector having the maximum expected reward at time t, i.e. E(Xkt,t)=maxkKE(Xk,t), tT. The learning agent wants to make a series of right decisions about the playing arms {At,tT} to maximise the expected cumulative reward, i.e. maxE(t=1TXAt,t), for the entire T time periods (Srivastava et al., Citation2014). Equivalently, it is to minimise the T-step cumulative regret (Cao et al., Citation2019): (1) R(T)=t=1TmaxkKEXk,tEt=1TXAt,t,(1) i.e., the expected total loss of playing arms {At,tT}.

For the piecewise-stationary MAB scenario, define Ft(k) as the reward distribution of the kth selected arm at time t. The reward Xk,t is independently sampled from Ft(k), both across arms and across time slots. Here, the Ft(k) can have various types, such as uniform, Bernoulli and exponential distributions. When outliers exist, unusually small or large rewards are encountered, which may come from the reward distributions with extreme probabilities or simply from collection errors.

Let S be the number of piecewise-stationary segments in the reward sequence: S=1+t=1T1×I{E(Xk,t)E(Xk+1,t) for at least one kK},where I{} represents the indicator function. Here, we have S−1 change points represented by p1,p2,,pS1, which are defined to be the time slots that the changes occur. For notation consistency, we set p0=0 and pS=T. Within the same segments, the rewards follow the same distributions, but among different segments, the reward distributions can be different. Similar to Liu et al. (Citation2018) and Cao et al. (Citation2019), we consider using the means to describe the trends in the non-stationary data. Let μk(j)=E(Xk,pj1+1)==E(Xk,pj) be the mean response for the kth arm in the jth data segment where kK and j=1,,S. We consider the cases where there exists at least one arm kK such that μk(j)μk(j+1) (j=1,,S1) and |μk(j)μk(j+1)| is not very small (see Assumption 2.1(b)) for detectability, which excludes infinitesimal mean shift and is a reasonable assumption in practice (Liu et al., Citation2018). There is no requirement on the shape of the reward distribution. Note that, if we set S = 1 in our framework, it becomes the classic stochastic bandit model (Besbes et al., Citation2014); if we set S=T, it becomes the classic adversarial bandit model (Garivier & Moulines, Citation2011).

2.2. The RCD-UCB algorithm

In this part, we propose a novel algorithm framework for piecewise-stationary MAB problems in the presence of outliers. It improves the current change-detection UCB framework by incorporating a data-driven tail truncation strategy that can distinguish the real change points from the outliers. Current popular tail truncation methods to restrict unexpected changes caused by extreme values include:

(1) Huber loss: L(μj,μj)=(μjμj)2if |μjμj|<a2a|μjμj|a2otherwise,(2) the biweight loss: L(μj,μj)=(μjμj)2if |μjμj|<aa2otherwise,and (3) if interest lies in changes in the uth quantile for 0<u<1: L(μj,μj)=2u(μjμj)if μj>μj2(1u)(μjμj)otherwise.In particular, if u = 0.5, the loss function (3) reduces to |μjμj|. The proposed algorithm framework can incorporate various tail truncation methods. Yet, we do not recommend to use the loss function (3) since the mean changes are of interest here. Compared to the Huber loss, our proposed algorithm with the biweight loss generally have better performance by some simulation explorations. Thus, in this paper, we focus on using the biweight loss in the change point detection, where the parameter a is determined via a data-driven approach.

In Algorithm 1, we show our robust change detection (RCD) method using the biweight loss. It considers a change point detection strategy based on comparing running sample means over a sliding window. Here, the window width w and the statistical distance threshold b are tuning parameters, which can be chosen empirically or based on the theoretical results in Section 2.3. The selected w must be even since the means of the first and second halves of the running samples (Y1,,Yw) are compared. The biweight loss function bounds the absolute differences of sample means at most a, where parameter a is updated through historical data and controlled by the tuning parameter α in Algorithm 2. This simple RCD algorithm has minimum parameter specification and thus is computationally efficient.

Next, we show the proposed RCD-UCB algorithm in Algorithm 2 whose parameters mainly include the total number of time slots T, the number of arms K, the policy rotation parameter γ, the delay parameter D and the outlier truncation probability α. The tuning parameter γ controls the fraction of the uniform sampling used for feeding the RCD algorithm (line 4), which can be determined based on the theoretical results in Section 2.3. For the kth arm, let ak denote the upper bound for calculating the truncation distance in the biweight loss function (line 3 in Algorithm 1) and Lk be the list used to record the offset distances (line 15) when each detection is not significant. The tuning parameter α(0,1) controls the values of ak which are updated by the upper α quantiles of Lk. We set the tuning parameter D to guarantee that Lk have at least D elements before calculating the quantiles. Let τ indicate the latest moment when the change point was detected. Denote nk as the number of observations of the kth arm after time τ.

Here, we briefly introduce the work flow of the RCD-UCB in Algorithm 2. At each time t, the RCD-UCB decides whether to do a uniform sampling exploration (line 4) or a UCB1 exploration (line 6), such that the fraction of time slots used for the uniform sampling is roughly γ. Note that the arms A1,,AK are played sequentially in the first K time slots. When calculating the UCB1 index (Auer, Cesa-Bianchi et al., Citation2002; Lattimore & Szepesvári, Citation2020), only the observations since the last detection time τ (Zk,1,,Zk,nk for the kth arm) are used (line 7). Refer to the Appendix A for some details on the UCB1 algorithm and index. Next, when the cumulative data volume nAt is greater than the data window width w, we adaptively perform the RCD in Algorithm 1 (line 12). If a change point is detected, the exploration will be reset (line 13); otherwise, the exploration continues and the offset distance is recorded in the list Lk (line 15). When Lk includes at least D historical non-significant offset distances, the current value of aAt will be updated by the (1α)×100% quantiles (line 17). Based on the empirical results, the tuning parameter α can be chosen from [0.01,0.1]. As α increases, the RCD-UCB becomes more robust against possibly many outliers but less sensitive for identifying change points. Our simulation experiments in Section 3 show that α=0.025 (or 0.05) often gives satisfactory results.

Same as Cao et al. (Citation2019) and Liu et al. (Citation2018), we assume that the initial w-length data are stable (i.e., no change occurs) and there are reasonably many observations between the two real change points. The parameter ak is used to truncate the offset distances and is updated through historical data. If the current value of aAt is zero or smaller than the tuning parameter b, the RCD algorithm will always return to False. This situation occurs at the beginning of each exploration. As the exploration continues, we can expect aAt gradually increases until finding the next change point.

2.3. Theoretical results on performance analysis

In this part, we analyse the regret upper bound of the proposed RCD-UCB algorithm and specify its tuning parameters. We show the RCD-UCB can achieve a nearly optimal regret bound with the same order as the M-UCB by Cao et al. (Citation2019). Note that the M-UCB assumes no outliers.

For the kth arm on the ith piecewise-stationary data segment, we define the sub-optimal gap Δk(i) as the difference in expected returns between the optimal arm in K and the selected arm: Δk(i)=maxkKμk(i)μk(i),1iS,kK.The sub-optimal gaps can be used to characterise the regret. Let ak(i) be the final truncated threshold for the kth arm at the ith change point. Define the amplitude of the change for the kth arm at the ith change point as δk(i)=min|μk(i+1)μk(i)|,ak(i),1iS1,kK.

Assumption 2.1

The learning agent can choose w and γ such that (a) S<T/L and pi+1pi>L, 0iS1, where L=(w+2D)K/γ, and (b) 1iS1, kK such that δk(i)2log(2KT2)/w+2log(2T)/w.

Assumption 2.1 is standard in the existing literature (Cao et al., Citation2019; Liu et al., Citation2018). Intuitively, Assumption 2.1(a) guarantees that we have reasonably many observations (larger than L) between two consecutive change points, where Algorithm 2 can select at least w + D samples from every arm to feed the RCD in Algorithm 1. Assumption 2.1(b) guarantees that at least one arm has a large enough change amplitude at each change point such that the RCD algorithm is able to detect the change quickly with limited information without affecting the false-positive rate and detection delay. We would like to remark that Assumption 2.1 is necessary for proving the theoretical results, but the RCD-UCB algorithm can still perform well in practice (though the theoretical results are not proved) when Assumption 2.1 is not satisfied. Similarly, parameter S is assumed to be known for proving the theoretical results, but the RCD-UCB can perform well with unknown S in practice.

Theorem 2.2

Let the lower bound δ=mini=1,,S1maxkKδk(i) and assume δ>0. Under the piecewise-stationary scenario, if we run Algorithm 2 with fixed w and D, b=[wlog(2KT2)/2]1/2, and γ=(S1)Kmin(w/2,b/δ+3w)/(2T), the upper bound for the regret R(T) is O(SKTlogT), where T is the number of time steps, K is the number of arms and S is the number of stationary segments.

The regret upper bound in Theorem 2.2 has the same order as the one in Cao et al. (Citation2019) and thus it is a nearly optimal one. Theorem 2.2 also provides guidance on the tuning parameters. Here, we set the value of w(4/δ2)[(log(2KT2))1/2+(log(2T))1/2]2 to meet Assumption 2.1(b), and choose b and γ to simplify the regret upper bound in Theorem 2.2. Detailed proofs are relegated to the Appendix.

3. Experimental results

In this section, we apply the proposed RCD-UCB algorithm to three simulation experiments and a real case study where outliers may exist. We choose the currently most popular piecewise-stationary MAB method: the M-UCB algorithm (Cao et al., Citation2019) as the benchmark method. We also list the performance from other existing approaches, including the UCB1 (Auer, Cesa-Bianchi et al., Citation2002), the D-UCB (Kocsis & Szepesvári, Citation2006) and the SW-UCB (Garivier & Moulines, Citation2011) algorithms.

3.1. Simulation experiment 1

We first consider a piecewise-stationary MAB problem with K = 3 arms, T = 10000 time periods and S = 5 stationary segments (i.e. S−1 = 4 change points). We assume the change points are located evenly over the time horizon, i.e. a change occurs every 2000 time periods. The reward distributions of all arms are assumed to be normal, piecewise-stationary with μk(i) randomly selected from {0.05,0.1,0.15,,0.7} (i=1,,S and k=1,,K) and having the same standard deviation σ=0.1. Figure  shows the expected rewards for each arm over the time horizon. At each time step t (t=1,,T), there exists a Bernoulli trial Zt{0,1} with the probability P(Zt=1)=0.025 to decide whether to use an outlier YtN(2,σ2) to replace the original reward data at the current moment. That is, the rewards received by the learning agent in this MAB problem are (2) Xk,t=I{Zt=0}X~k,t+I{Zt=1} Yt,(2) where X~k,tN(μk(i),σ2) for pi1+1tpi.

Figure 1. Expected rewards for arms in the simulation experiment 1.

Figure 1. Expected rewards for arms in the simulation experiment 1.

We first compare the proposed RCD-UCB with the benchmark method: the M-UCB algorithm (Cao et al., Citation2019). In the RCD-UCB, we set the tuning parameter D = 200 and consider four different settings for α: 0.01, 0.025, 0.05 and 0.1. According to Theorem 2.2 in this paper and Remark 1 in Cao et al. (Citation2019), we set other tuning parameters as w = 100, b=15 and γ=0.19 for both the RCD-UCB and the M-UCB algorithms. We replicate the simulation experiments 100 times for all algorithms and show the average results. In Figure , we display the expected cumulative regrets for the M-UCB and the four RCD-UCB policies (RCD-UCB.S1 to RCD-UCB.S4 using α=0.01,0.025,0.05 and 0.1, respectively). It is seen that all the four RCD-UCB policies receive much smaller regrets than the M-UCB algorithm. These four policies perform similarly, where the RCD-UCB.S2 and RCD-UCB.S3 are slightly better in terms of the regret at the end time T .

Figure 2. Expected cumulative regrets for the M-UCB and RCD-UCB with different α values in the simulation experiment 1.

Figure 2. Expected cumulative regrets for the M-UCB and RCD-UCB with different α values in the simulation experiment 1.

Figure 3. Expected rewards for arms in simulation experiment 2.

Figure 3. Expected rewards for arms in simulation experiment 2.

Figure 4. Expected cumulative regrets for M-UCB and RCD-UCB with different α values in the simulation experiment 2.

Figure 4. Expected cumulative regrets for M-UCB and RCD-UCB with different α values in the simulation experiment 2.

In Table , we list the average number of detected change points and its standard deviation for each algorithm among the 100 replications. The M-UCB declares about 7 change points, much larger than its true value of 4. When α=0.025 and α=0.05, the average numbers of change points detected by the RCD-UCB are close to 4, which explains the superior performance of the RCD-UCB.S2 and RCD-UCB.S3 policies. In the RCD-UCB algorithm, the number of detected change points decreases as α increases. As discussed in Section 2.2, there is a trade-off when choosing the tuning parameter α. Larger α (i.e. larger truncation probability) makes the RCD-UCB more robust towards possibly many outliers, but at the same time makes the algorithm more conservative for identifying change points.

Table 1. Average number (AVE) and standard deviation (SD) of detected change points by each algorithm in the simulation experiment 1.

In addition, we compare the RCD-UCB.S2 (α=0.025) with the UCB1, the D-UCB and the SW-UCB algorithms. In Table , we show the average cumulative regrets at time T and their standard deviations over 100 replications for different algorithms. According to Garivier and Moulines (Citation2011), we set γ=10.25(S1)/T in the D-UCB and τ=22Tlog(T)/(S1) in the SW-UCB here. From Table , we can see that the RCD-UCB gives much smaller (roughly less than a half) T-step cumulative regrets compared to the UCB1, the D-UCB and the SW-UCB algorithms. In addition, we report the average computational time (in seconds) for each algorithm in Table . All codes were run in R on a laptop with an Intel 1.60GHz I5 CPU. We can see that all algorithms are very fast for running an MAB problem with T = 10000 time periods. The RCD-UCB algorithm spends a bit more time than the other algorithms, but is still fast enough.

Table 2. Average cumulative regrets R(T) (AVE), standard deviations (SD) and computational time in seconds (TIME) for different algorithms in the simulation experiment 1.

3.2. Simulation experiment 2

Next, we consider an MAB problem whose reward distributions of all arms are assumed to be Bernoulli, where there are K = 4 arms, T = 10000 time periods and S = 9 stationary segments (i.e. 8 change points). The mean rewards of all arms μk(i) are randomly selected from {0.05,0.1,0.15,,0.95} (i=1,,S and k=1,,K). Figure  shows the expected rewards for each arm over the time horizon. At each time step t (t=1,,T), there exists a Bernoulli trial Zt{0,1} with probability P(Zt=1)=0.025 to decide whether we will use an outlier YtBernoulli(0.99) to replace the original reward data at the current moment. Thus, the rewards received by the learning agent in this MAB problem are Xi,t=I{Zt=0}X~i,t+I{Zt=1} Yt,where X~k,tBernoulli(μk(i)) for pi1+1tpi.

We first compare the proposed RCD-UCB with the benchmark M-UCB. In the RCD-UCB, we set D = 200 and and consider four different settings for α: 0.01, 0.025, 0.05 and 0.1 which are denoted as the RCD-UCB.S1 to RCD-UCB.S4 policies, respectively. Based on Theorem 2.2 in this paper and Remark 1 in Cao et al. (Citation2019), we use w = 100, b = 12 and γ=0.15 for both the RCD-UCB and the M-UCB algorithms. In Figure , we display the expected cumulative regrets for the M-UCB and the four RCD-UCB policies. From Figure , it is clear that all the RCD-UCB policies give smaller regrets compared to the M-UCB algorithm. Here, the RCD-UCB.S1 (α=0.01) and the RCD-UCB.S2 (α=0.025) provide the best performance. We list the average number of detected change points and its standard deviation for each algorithm over the 100 replications in Table . Note that the true number of change points here is S−1 = 8. Due to the existence of outliers, the M-UCB declares about 15 change points, which is much larger than the truth. As a comparison, the RCD-UCB.S1 (α=0.01) and the RCD-UCB.S2 (α=0.025) identify 8.76 and 7.90 change points on average, respectively, which are very close to the truth.

Table 3. Average number (AVE) and standard deviation (SD) of detected change points by each algorithm in the simulation experiment 2.

In addition, we compare the RCD-UCB.S2 (α=0.025) with the UCB1, the D-UCB and the SW-UCB algorithms in Table  which shows the average cumulative regrets at time T, their standard deviations and the average computational time (in seconds) for different algorithms over 100 replications. Here we set γ=10.25(S1)/T for the D-UCB and τ=22Tlog(T)/(S1) for the SW-UCB according to Garivier and Moulines (Citation2011). From Table , it is clear that the RCD-UCB gives much smaller (nearly one half) T-step cumulative regrets compared to the other methods. The computational efficiency of each algorithm here is similar to that in the simulation experiment 1.

Table 4. Average cumulative regrets R(T) (AVE), standard deviations (SD) and average computational time in seconds (TIME) for different algorithms in the simulation experiment 2.

3.3. Simulation experiment 3

In this simulation study, we aim to show the proposed RCD-UCB algorithm can still perform well when there are no outliers. Here we consider the same MAB problem in the simulation experiment 1 except that there are no outliers (Zt=0 in Equation (Equation2)). We run the RCD-UCB (α=0.01, α=0.025, α=0.05, α=0.1), M-UCB, UCB1, D-UCB and SW-UCB algorithms with the same settings of tuning parameters as those in the simulation experiment 1.

We list the average number of detected change points and its standard deviation for each of the four RCD-UCB algorithms and the M-UCB algorithm over 100 replications in Table . It is seen that the numbers of detected change points for all the five algorithms are close to the true value 4. In addition, Table  shows the average cumulative regrets at time T and their standard deviations over 100 replications for all the eight algorithms. From Table , we can see that the M-UCB algorithm outperforms the UCB1, D-UCB and SW-UCB algorithms. The four RCD-UCB algorithms give similar T-step cumulative regrets as the M-UCB algorithm; while, their standard deviations are slightly larger. This meets our expectation that the proposed RCD-UCB algorithm can also provide desirable performances for cases with no outliers, though they are designed for the cases having outliers.

Table 5. Average number (AVE) and standard deviation (SD) of detected change points by each algorithm in the simulation experiment 3.

Table 6. Average cumulative regrets R(T) (AVE) and standard deviations (SD) by each algorithm in the simulation experiment 3.

3.4. A real data analysis

We consider a real data example from a metalwork factory in China, which had a task for machining a type of cold-rolled alloy products. There are several parallel production lines machining the same products. The cold-rolled alloys will be annealed, reduced, strengthened and reshaped through the production lines. The elongation rate is one of the key indexes for evaluating the quality of cold-rolled alloy products. It measures the rate of elongation at breaking under the maximum load on the alloy. Larger elongation rate is preferred. In this case study, the original data set includes the elongation rates of the cold-rolled alloys produced by K = 6 parallel production lines over a long time period T=42842. At each time t (t=1,,T), six cold-rolled alloys of the same type from the six production lines are available, and we want to adaptively select one from the six to use. The aim is to maximise the overall elongation rate for the selected products. If we view the elongation rates as rewards, this is an MAB problem with K = 6 arms.

Based on the original data, we group the elongation values of every 1000 successive times for each arm. Figure  shows the average rewards for each arm over the time horizon, which ranges from 18.68 to 49.98. All the six arms have non-stationary rewards and there are possibly many change points. Here, unexpected events (e.g. recording errors) may happen which will cause the rewards to be zero. Such outliers can significantly bias the detection of change points in the existing piecewise-stationary algorithms.

Figure 5. Average rewards for arms in the elongation data analysis.

Figure 5. Average rewards for arms in the elongation data analysis.

For analysing this real data, we run the proposed RCD-UCB, the benchmark M-UCB, the UCB1, the D-UCB and the SW-UCB algorithms. Specifically, we set the tuning parameters w = 200, b = 80 and γ=0.01 for the RCD-UCB and the M-UCB algorithms. Based on the simulation experiments in Sections 3.1 and 3.2, we can see α=0.025 is a good choice. We set α=0.025 and D = 100 in the RCD-UCB. According to Garivier and Moulines (Citation2011), we set the tuning parameters γ=0.99 and τ=1000 for the D-UCB and SW-UCB, respectively.

Table  lists the values of T-step cumulative regrets R(T) for all algorithms. It is seen that the proposed RCD-UCB method performs the best among all algorithms and it achieves a nearly 50% reduction in cumulative regrets compared to the M-UCB algorithm. The M-UCB performs slightly better than the D-UCB and SW-UCB algorithms. The UCB1 policy yields the largest regret and is much worse than the others. This is because the UCB1 does not take the non-stationary scenario into consideration. Figure  further plots the cumulative regrets for the D-UCB, the SW-UCB, the M-UCB and the RCD-UCB algorithms. From Figure , we can also see that the RCD-UCB outperforms all other methods.

Figure 6. Cumulative regrets for different algorithms in the real data analysis.

Figure 6. Cumulative regrets for different algorithms in the real data analysis.

Table 7. Cumulative regrets R(T) for different algorithms in the elongation data analysis.

4. Conclusion

In this paper, we consider a general setting of piecewise-stationary MAB problems and propose a RCD-UCB algorithm that is robust to outliers. The RCD-UCB has a simple formulation and is computationally efficient in practice. It can achieve a nearly optimal regret bound on the order of O(SKTlogT) under some common assumptions. Most tuning parameters in the RCD-UCB can be specified based on the theoretical results. Yet, if some prior information on the MAB (e.g. the number of piecewise-stationary segments S and the lower bound δ) is unknown in practice, the tuning parameters, including the window width w, the statistical distance threshold b, the exploration probability γ, the truncation probability α and the delay D, need to be chosen based on the practitioner's experience, which is typical in the existing literature (Cao et al., Citation2019; Liu et al., Citation2018). Specifically, larger parameter α will make the RCD-UCB more robust towards outliers, but at the same time more likely to miss real change points. Based on the simulation studies in this paper, a choice of α=0.025 or 0.05 may be appropriate. Larger parameter D will lead to more stable initial estimates of truncated thresholds, but may also result in longer detection delay. As D is often much smaller than the number of observations between consecutive change points, its impact is usually small in practice. In the current works on the piecewise-stationary MAB problems (Cao et al., Citation2019; Liu et al., Citation2018), there lacks a systematic and automatic way to handle tuning parameters when no prior information is available, and it will be an interesting topic for the future research.

Acknowledgments

Wang was supported in part by NSFC (11901199 and 71931004) and Shanghai Sailing Program (19YF1412800). Zhang was supported in part by NSFC (11831008 and 11971171), the National Social Science Foundation Key Program (17ZDA091), and the 111 Project of China (B14019).

The authors thank the editor, the associate editor and the reviewers for their helpful comments.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work was supported by National Natural Science Foundation of China [11901199, 71931004, 11831008, 11971171], Shanghai Sailing Program [19YF1412800], National Social Science Foundation Key Program [17ZDA091] and the 111 Project of China [B14019].

Notes on contributors

Yaping Wang

Dr. Yaping Wang is an assistant professor in school of statistics at East China Normal University.

Zhicheng Peng

Mr. Zhicheng Peng received his master's degree in statistics from East China Normal University in 2020 and is now a researcher at the Ant Group.

Riquan Zhang

Dr. Riquan Zhang is a professor in school of statistics at East China Normal University.

Qian Xiao

Dr. Qian Xiao is an assistant professor in department of statistics at University of Georgia.

References

  • Alaya-Feki, A. B. H., Moulines, E., & LeCornec, A. (2008). Dynamic spectrum access with non-stationary multi-armed bandit. In 2008 IEEE 9th Workshop on Signal Processing Advances in Wireless Communications. (pp. 416–420). IEEE.
  • Allesiardo, R., & Féraud, R. (2015). Exp3 with drift detection for the switching bandit problem. In 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 1–7). IEEE.
  • Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3, 397–422. https://doi.org/10.5555/944919.944941
  • Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256. https://doi.org/10.1023/A:1013689704352
  • Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32, 48–77. https://doi.org/10.1137/S0097539701398375
  • Besbes, O., Gur, Y., & Zeevi, A. (2014). Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems (pp. 199–207). MIT Press.
  • Buccapatnam, S., Liu, F., Eryilmaz, A., & Shroff, N. B. (2017). Reward maximization under uncertainty: leveraging side-observations on networks. Journal of Machine Learning Research, 18, 7947–7980. https://doi.org/10.5555/3122009.3242073
  • Cao, Y., Wen, Z., Kveton, B., & Xie, Y. (2019). Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. In Proceedings of Machine Learning Research (pp. 418–427). PMLR.
  • Garivier, A., & Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory (pp. 174–188).  Springer.
  • Hartland, C., Baskiotis, N., Gelly, S., Sebag, M., & Teytaud, O. (2007). Change point detection and meta-bandits for online learning in dynamic environments. In CAp 2007: 9è Conférence francophone sur l'apprentissage automatique (pp. 237–250). CEPADUES.
  • Hinkley, D. V. (1971). Inference about the change-point from cumulative sum tests. Biometrika, 58, 509–523. https://doi.org/10.1093/biomet/58.3.509
  • Kaufmann, E., Korda, N., & Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory (pp. 199–213). Springer.
  • Kocsis, L., & Szepesvári, C. (2006). Discounted UCB. In 2nd PASCAL Challenges Workshop (Vol. 2). http://videolectures.net/pcw06_kocsis_diu/
  • Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22. https://doi.org/10.1016/0196-8858(85)90002-8
  • Lattimore, T., & Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
  • Li, S., Karatzoglou, A., & Gentile, C. (2016). Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 539–548). New York, NY: Association for Computing Machinery.
  • Liu, F., Lee, J., & Shroff, N. (2018). A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press.
  • Mellor, J., & Shapiro, J. (2013). Thompson sampling in switching environments with Bayesian online change detection. In Artificial Intelligence and Statistics (pp. 442–450). Springer.
  • Srivastava, V., Reverdy, P., & Leonard, N. E. (2014). Surveillance in an abruptly changing world via multiarmed bandits. In 53rd IEEE Conference on Decision and Control (pp. 692–697). IEEE.
  • Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25, 285–294. https://doi.org/10.1093/biomet/25.3-4.285
  • Yu, J. Y., & Mannor, S. (2009). Piecewise-stationary bandit problems with side observations. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 1177–1184). New York, NY: Association for Computing Machinery.

Appendices

Appendix 1. The UCB1 Algorithm

Auer, Cesa-Bianchi, Freund et al. (Citation2002) proposed an upper confidence bound algorithm, denoted as the UCB1 policy, which becomes a step stone for MAB problems. Following the notations in Section 2.1, we describe the UCB1 policy in Algorithm 3. To balance the exploitation and exploration, the UCB1 algorithm first tries all arms once and then sequentially select the arms with the highest upper bound on its confidence interval, i.e. the UCB1 index (line 7 in Algorithm 3).

Here we briefly summarise its mathematical background. If X1,,Xn are independent and 1-subgaussian, we have Pt=1nXtnϵexp(nϵ2/2).Let the right-hand side of this equation be δ and we solve for ϵ. Then, we have Pt=1nXtn2nlog(1δ)δ.Following the notations in this paper, when the learner is deciding its action in the time slot t, a good candidate for the largest plausible estimate of the mean for the arm k is 1nkn=1nkZk,n+2nklog1δ.According to Auer, Cesa-Bianchi, Freund et al. (Citation2002), a good choice for the time dependent δ is 1t. Thus, we have the UCB1 index for the kth arm as UCBk=nk1n=1nkZk,n+2log(t)/nk. Please refer to Auer, Cesa-Bianchi, Freund et al. (Citation2002) for more details on the UCB1 algorithm.

Appendix 2.

Proof of Theorem 2.2

The basic idea of this proof follows the same line of that for classic change detection MAB algorithms. The following lemmas from Cao et al. (Citation2019) are needed and rephrased with our notations.

Lemma A.1

Regret bound in stationary scenarios

Consider a stationary scenario with S = 1, p0=0 and p1=T. Then under Algorithm 2 with parameter w, b and γ, we have that R(T)TP(τ1T)+C~+γT,where τ1 is the first detection time and C~=8Δk(1)>0logT/Δk(1)+1+π2/3+Kk=1KΔk(1).

Note that the RCD-UCB in Algorithm 2 is stricter than the M-UCB algorithm by Cao et al. (Citation2019) for detecting change points. The probability of raising false alarms in the stationary scenario for the RCD-UCB cannot exceed that for the M-UCB. Thus, the following Lemma A.2 (Lemma 2 of Cao et al. (Citation2019)) holds for the RCD-UCB.

Lemma A.2

Probability of raising false alarms in the stationary scenario

Consider a stationary scenario with S = 1. Then under Algorithm 2 with parameter w<T, b and γ, we have that P(τ1T)<wK1[12exp(2b2/w)]T/w,where τ1 is the first detection time.

When L = w + 2D, the uniformity sampling scheme (line 4 of Algorithm 2) guarantees that each arm is sampled at least w/2+D times in any time. Thus, the following Lemma A.3 (Lemma 3 of Cao et al. (Citation2019)) holds, which ensures the detection delay is no more than L/2 with a large probability.

Lemma A.3

Probability of achieving a successful detection with S = 2

Consider a stationary scenario with M = 2 and L=(w+2D)K/γ. Assume that p2p1>L/2. For any (μ1(1),μ2(1),,μK(1)) and (μ1(2),μ2(2),,μK(2))[0,1]K satisfying δk(1)2b/w+c for some kK and c>0, under Algorithm 2, we have that P(p1<τ1p1+L/2p1>p2)12exp(wc2/4).

Lemma A.4

Expected detection delay

Consider a piecewise-stationary scenario with M = 2 and L=(w+2D)K/γ. Assume that p2p1>L/2. For any (μ1(1),μ2(1),,μK(1)) and (μ1(2),μ2(2),,μK(2))[0,1]K satisfying δk(1)2b/w+c for some kK and c>0, under Algorithm 2, we have that E(τ1p1p1<τ1p1+L/2)minL/2,b/δk(1)+3wK/γ/×[12exp(wc2/4)].

Based on Lemmas A.1–A.4 and Theorem 1 in Cao et al. (Citation2019), it holds that R(T)i=1S8Δk(i)>0log(pipi1)/Δk(i)×+1+π2/3+Kk=1KΔk(i)+γT+γ1i=1S1×2Kmin{w/2,b/maxkKδk(i)+3w}+3S.For each i=1,,S, (8Δk(i)>0log(pipi1)/Δk(i)+(1+π2/3+K)k=1KΔk(i)) is a classic regret bound for the UCB1 algorithm with time length pipi1. The term i=1S[8Δk(i)>0log(pipi1)/Δk(i)+(1+π2/3+K)k=1KΔk(i)] is of order O(SKTlogT) (Cao et al., Citation2019). The term γT cannot exceed O(SKTlogT+logK)=o(SKTlogT). The term γ1i=1S1(2Kmin{w/2,b/maxkKδk(i)+3w}) cannot exceed O(SKT)=o(SKTlogT) and the term 3S is of order O(S)=o(SKTlogT). Thus the result follows.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.