92
Views
0
CrossRef citations to date
0
Altmetric
Original Research

Local in Time Statistics for detecting weak gene expression signals in blood – illustrated for prediction of metastases in breast cancer in the NOWAC Post-genome Cohort

, , &
Pages 11-28 | Published online: 10 Jul 2017
 

Abstract

Background:

Functional genomics in a processual analysis cover the time-dependent changes in transcriptomics and epigenetics before diagnosis of a disease, reflecting the changes in both life style and disease processes. The aim of this paper is to explore the dynamic, time-dependent mechanisms of the metastatic processes, using blood transcriptomics and including time in a continuous manner. For achieving this goal, we have developed new statistical methods based on statistics that are local in time.

Methods:

The new statistical method, Local in Time Statistics (LITS), is based on calculating statistics in moving windows and randomization. The method has been tested for the analysis of a dataset that collectively provides information on the blood transcriptome up to 8 years before breast cancer diagnosis. The dataset from the Norwegian Women and Cancer (NOWAC) Post-genome Cohort consists of 467 case-control pairs matched on birth year and time of blood sampling. The data for a pair are the difference in log2 gene expression between the case and control. The stratified analyses are based on important biological differences like metastatic versus non-metastatic cancer, and the mode of cancer detection, ie, screening-detected cancers versus clinically detected cancers. The dataset was used for examining whether the gene expression profile varies between cases and controls, with time, or between cases with and without metastases.

Results:

The null hypotheses of no differences between cases and controls, no time-dependent changes, and no differences between different strata were all rejected. For screening-detected cancers, the probability of correct prediction of metastasis status was best in year 1 before diagnosis compared to year 3 and 4 before diagnosis for clinically detected cancers. The predictor was not very sensitive to the number of genes included.

Conclusion:

Using a new statistical method, LITS, we have demonstrated time-dependent changes of the blood transcriptome up to 8 years before breast cancer diagnosis.

Acknowledgments

We are thankful to and impressed by the women who donated blood for this cancer research project. Bente Augdal, Merete Albertsen, and Knut Hansen were responsible for all infrastructure and administrative issues. We thank Clara-Cecilie Günther for preprocessing the data. This study was supported by a grant from the European Research Council (ERC-AdG 232997 TICE). The funders had no role in the design of the study; in the collection, analyses, and interpretation of the data; in the writing of the manuscript; or in the decision to submit for publication. Some of the data in this article are from the Cancer Registry of Norway. The Cancer Registry of Norway is not responsible for the analysis or interpretation of the data presented. Microarray service was provided by the Genomics Core Facility, Norwegian University of Science and technology, and NMC – a national technology platform supported by the functional genomics program (FUGE) of the Research Council of Norway. The data will be stored at the European Genome-phenome Archive (EGA, https://www.ebi.ac.uk/ega/),Citation33 where it will be accessible on request.

Author contributions

All authors contributed toward data analysis, drafting and revising the paper and agree to be accountable for all aspects of the work.

Disclosure

The authors report no conflicts of interest in this work.

Supplementary material

Method

Adjusting for the batch effect

Here, we give a short description of the ComBat method developed by Johnson et alCitation1 for estimating the batch effects and how to use these estimates for adjusting for the batch effects when computing sample means and standard deviations.

The log2 gene expression value Yijg for gene g and sample j from batch i is modeled as

Yijg=ag+Xβg+γig+δigεijgandεijkNormal(0,σ2),

where

  • αg is the overall gene expression,

  • X is a design matrix for sample conditions,

  • βg is the vector of regression coefficients corresponding to X,

  • γig is the additive batch effect, and

  • δig is the multiplicative batch effect.

The batch-adjusted data Yijg can then be computed as

Yijg*=Yijga^gXβ^gγ^igδ^ig+a^g+Xβ^g.

The estimates of the parameters αg, βg, γig, and δig are computed using an empirical Bayes method.Citation1 Note that in the implementation of the method, the batch-adjusted data Yijg are computed as Yijg*=Yijgα^gXβ^gσ^γ^ig,δ^igσ^+α^g+Xβ^g, where γ^ig=γ^igσ^ is the parameter that is estimated instead of γ^ig.

Both the expectation and the variance of a gene for the cases can vary both with time and stratum. We therefore cannot use the ComBat method for batch adjusting the dataset that consists of differences in log2 gene expression between cases and controls. Instead we will use ComBat to estimate the batch effects γ^ig and δ^ig from a dataset that includes only the log2 gene expressions for the controls.

Log2 gene expression data that are adjusted for the additive batch effect γig, but not for the multiplicative batch effect δig, can then be computed as

Yijg=Yijgγ^ig=μGg+δ^igεijgwhereεijgNormal(0,σG2)forgroup G.

For case-control pair c from batch i with sample j1 as control (from group G1) and sample j2 as case (from group G2), we have the log2-expression difference of Xg,c.

Xg,c=Yij2gYij1g=Yij2gYij1g=μg+δ^igεg,cwhereεg,cNormal(0,σ2)

We observe that Xg,c is adjusted for the additive batch effect γig, but not for the multiplicative batch effect δig.

We compute the estimate of μg, μ^g as the weighted average of Xg,c, where the weights are 1δ^ig, and we compute the estimate of σ2,σ^2,as1n1c=1n(Xg,cμ^gδ^ig)2. We will compare estimated sample means and standard deviations between genes. For each gene, we therefore multiply the estimates of δ^ig by a constant so that for this gene 1Bi=1Bδ^ig=1, where B is the number of batches/runs.

Figure S1 Boxplots illustrating how the score used in the predictor depends on the number of genes included in the score.

Notes: The score has been normalized by dividing with the number of genes included in the score. The score for the cases with metastases should be positive (lower panel), while the scores for the cases without metastases should be negative (upper panel). (A) Scores for case-control pairs around 6 months from the screening group. (B) Scores for case-control pairs around 2 years and 6 months from the clinical group.
Figure S1 Boxplots illustrating how the score used in the predictor depends on the number of genes included in the score.