Abstract
Individualized treatment rules (ITRs) tailor treatments according to individual patient characteristics. They can significantly improve patient care and are thus becoming increasingly popular. The data collected during randomized clinical trials are often used to estimate the optimal ITRs. However, these trials are generally expensive to run, and, moreover, they are not designed to efficiently estimate ITRs. In this article, we propose a cost-effective estimation method from an active learning perspective. In particular, our method recruits only the “most informative” patients (in terms of learning the optimal ITRs) from an ongoing clinical trial. Simulation studies and real-data examples show that our active clinical trial method significantly improves on competing methods. We derive risk bounds and show that they support these observed empirical advantages. Supplementary materials for this article are available online.
Appendix A: Technical Proofs
A.1. Intrinsic Dimension of supp(Π)
We explain the meaning of “intrinsic dimension” introduced in Assumption (A3) here. We say that supp(Π) possesses a tree decomposition if
1. | T1, 1 = supp(Π), and {Ti, j}J(i)j = 1 forms a disjoint partition of supp(Π) for all i ⩾ 1. | ||||
2. | Nested partition: ∀i ⩾ 2, j = 1, …, J(i), there exists a unique 1 ⩽ k ⩽ J(i − 1) such that Ti, j⊂Ti − 1, k. | ||||
3. | Bounded diameter: for all i ⩾ 1, 1 ⩽ j ⩽ J(i),
| ||||
4. | Regularity: for any i ⩾ 1, 1 ⩽ j ⩽ J(i) and 0 < r ⩽ 2− i the following holds: there exists a 1 ⩽ d ⩽ p (d is the intrinsic dimension) such that for all x ∈ Ti, j,
|
A simple example that gives a good intuition to the tree decomposition is the uniform distribution over the unit cube in . In this case, the tree decomposition is given by partitioning the unit cube into dyadic cubes and d = p. If supp(Π) is contained in a proper subspace W of
, then
.
Let be the sigma-algebra generated by the collection of sets {Ti, j, j = 1, …, J(i)} (a partition of supp(Π) on the level i). A regular approximation of ASk in Algorithm 1 is given by
.
A.2. Properties of Kernel Estimate
A.2.1. Preliminaries
For a measurable set S⊂supp(Π), define ΠS(dx) ≔ Π(dx|x ∈ S) as the conditional distribution on S, and set
Since Π is assumed to be known, we can directly compute Qh(x|S) now. Accordingly, we modify the original kernel estimate for ηj, that is, (Equation2.3
(2.3)
(2.3) ), as follows: let {(X(i), A(i), R(i)), i = 1…N} be an iid sample from the conditional joint distribution of (X, A, R) given that X ∈ S, and set
(A.2)
(A.2) We will discuss properties of these estimators in below.
Let h > 0, , and h ⩽ 2− j, and define
(A.3)
(A.3) We next study the upper and lower bounds of
based on Assumptions (A1)–(A4). Since K is bounded and compactly supported, there exists R = RK > 0 such that K(x) ⩽ ‖K‖∞I{x ∈ B(0, RK)}. Let F > 0 be a large enough constant, namely, Fd ⩾ 2c2/c1. Recall that c1, c2 are defined in (EquationA.1
(A.1)
(A.1) ). Note that Assumption (A2) implies the following:
(A.4)
(A.4) and
(A.5)
(A.5)
In what follows, we will set Qh(x|S) ≔ Qh, 0(x) for brevity.
A.2.2. Some Bounds for the Kernel Estimators
In this subsection, we derive basic concentration inequalities for the kernel estimators of , j = ±1 restricted to S, that is,
defined in (EquationA.2
(A.2)
(A.2) ). The proof of the results can be found in the online supplementary materials.
Lemma A.1.
For all t > 0 satisfying t + d2log (1/h) ⩽ nhd, with probability ⩾ 1 − 2e− t,
where C = C(M, c1, c2, L, LK, ‖K‖∞, ℓK, RK) is a constant.
The following corollary is immediate:
Corollary A.1.
Set hn ≔ {Π(S)(t + dlog (n/Π(S)))/n}1/(d + 2). Then, under assumptions of Lemma A.1, with probability ⩾ 1 − 4e− t,
where constant C is the same as in Lemma A.1.
A.3. Proof of Theorem 3.1
A.3.1. Comparison Inequality
Our Lemma A.2 below illustrates the connection between the risk of a treatment rule
and the sup-norm
.
Lemma A.2.
Under the margin assumption (A4),
Proof.
It is easy to see that . The rest of the argument repeats lemma 5.1 in Audibert and Tsybakov (Citation2007).
A.3.2. Main Proof
Our main goal is to control the size of the set actk defined by Algorithm 1. In turn, these bounds depend on the size of the confidence bands for f*(x) (denoted by δk). Suppose L ⩽ N is the number of iterations performed by the algorithm before termination.
Let Nactk ≔ ⌊Nk · Π(actk)⌋ be the number of labels requested on the kth iteration of the algorithm. We first claim that the following bounds hold uniformly for all 1 ⩽ k ⩽ L with probability at least 1 − α:
(A.6)
(A.6) where Cj = Cj(M, c1, c2, L, LK, ‖K‖∞, ℓK, RK, γ), j = 1, 2. This claim will be proved later.
Let be the event of probability ⩾ 1 − α on which both inequalities of (EquationA.6
(A.6)
(A.6) ) hold, and assume that it occurs. Second inequality of (EquationA.6
(A.6)
(A.6) ) implies, together with the fact that Nk = 2Nk − 1 by definition, that the number of randomized subjects on each step 1 ⩽ k ⩽ L satisfies
with probability ⩾ 1 − α. If N is the maximum number of randomized subjects the algorithm is allowed to request, then
and one easily deduces that on the last iteration L we have
(A.7)
(A.7) Recall that NL is defined in Algorithm 1.
To obtain the risk bound of the theorem from (EquationA.7(A.7)
(A.7) ), we apply Lemma A.2:
(A.8)
(A.8) Since
whenever bounds (EquationA.6
(A.6)
(A.6) ) hold, it remains to estimate
. Recalling the first inequality of (EquationA.6
(A.6)
(A.6) ) once again (for k = L), we get
where
, which together with (EquationA.8
(A.8)
(A.8) ) yields the final result.
It remains to show both inequalities of (EquationA.6(A.6)
(A.6) ). We start with the bound on
. First, note that by construction, for every k ⩾ 1 the samples (X(i, k), A(i, k), R(i, k)), i = 1…⌊NkΠ(actk)⌋ are conditionally independent given the data
collected on steps 1, …, k − 1, with conditional distribution of X(i, k) being
. Thus, we can apply Corollary A.1 conditionally on
with
to get that with probability ⩾ 1 − α/N,
It remains to integrate the bound with respect to the distribution of
:
The union bound over all 1 ⩽ k ⩽ L ⩽ N gives the result.
Finally, we will prove the second inequality of (EquationA.6(A.6)
(A.6) ), the bound for the size of the active sets actk. This is the place where assumption (A3) on the tree decomposition and margin assumption (A4) play the key role. To obtain the bound, we will compare two estimators of f*: the first is the kernel estimator
constructed by the Algorithm 1 on step k, and the second is the piecewise-constant estimator
with similar approximation properties to
. Namely,
is the L2(Π)-projection of f* on the linear space of piecewise-constant functions of the form
. Recall that Ti, j is defined in the tree decomposition of Section 7. As a result, we will be able to relate the “active sets” associated to these estimators, taking advantage of the fact that the active set associated to
is always a union of the sets from a collection
.
Let be the event of probability ⩾ 1 − α on which
for any k ⩾ 0, where δk = 4Chk. Assume that
occurs.
The following inclusions hold (for the definition of , see Algorithm 1):
(A.9)
(A.9) Indeed,
and
For all
, set
, and note that
where the last two inequalities follow from part 3 of assumption (A3) given in Appendix 7, and from the definition of mk. Define τk ≔ max (5δk, 4LK1hk) ⩽ C5δk,
to be the band of size (3/2)τk around
, and
By a reasoning similar to above, we have the inclusions
(A.10)
(A.10) Moreover, by the definition of τk, we have the inequality 5δk/2 ⩽ τk/2. Hence (EquationA.9
(A.9)
(A.9) ) and (EquationA.10
(A.10)
(A.10) ) imply that
. It remains to note that
1. |
| ||||
2. | By (EquationA.10 |
Supplementary Materials
Supplementary materials available online include: additional simulation results with normal biomarkers; sensitivity analysis regarding different prespecified parameters; performance of a doubly robust augmented inverse probability weighted estimator; and proof of Lemma A.1.
Acknowledgment
Guang Cheng was visiting SAMSI and on sabbatical at Princeton while this work was carried out and revised; he would like to thank SAMSI and Princeton ORFE department for their hospitality and support. We thank Dr. A. John Rush and the investigators for use of their data from the Nefazodone CBASP trial. The stimulant data used in this article were obtained from the datasets distributed by the NIDA.
Funding
Research Sponsored by NSF (DMS-0906497, CAREER Award DMS-1151692, DMS-1418042), Simons Fellowship in Mathematics, Office of Naval Research (ONR 11845813), and a grant from Indiana Clinical and Translational Sciences Institute.