873
Views
0
CrossRef citations to date
0
Altmetric
News & Views

Conference Scene: Lessons Learned from the 5th Statistical Analysis Workshop of the Pharmacogenetics Research Network

, , , , , , , , , , , , , & show all
Pages 297-303 | Published online: 17 Mar 2010

Abstract

The Pharmacogenomics Research Network holds a statistical analysis workshop every other year to share novel statistical methods and study designs for pharmacogenomics research, as well as insightful analyses of substantive ongoing studies. The 5th workshop was held 15 April 2009, in Rochester (MN, USA), in conjunction with the general Pharmacogenomics Research Network meeting. This summary of the ten contributed talks highlights a variety of timely topics, including identification of functional variants, how to maximize power using various study designs, and pathway analysis approaches. We also discuss the keynote invited presentation by Terry Speed, which provided an overview of statistical issues with next-generation sequence data with an emphasis on some statistical challenges in mRNA sequence data. Novel applications of Poisson regression models demonstrated innovative, yet practical, approaches to distinguish between biological and technical sources of variation in the counts of mRNA transcript reads. Overall, the workshop emphasized the need for diverse approaches to conducting pharmacogenomics studies, as well as the evolving nature of the field.

Pharmacogenomics is the study of the association of genetic variation with drug response, both favorable outcomes and adverse drug reactions. By understanding how genetic variation influences a drug‘s efficacy or toxicity, pharmacogenomics aims to optimize drug therapy to ensure maximum efficacy with minimal adverse effects. This summary of the Pharmacogenomics Research Network (PGRN) Analysis Workshop V highlights newly developed methods in genetic epidemiology that focus on pharmacogenomic applications. The presentations are roughly grouped into four topics: optimizing resources/study design issues (Scott Weiss [Harvard Medical School, MA, USA], Brooke Fridley [Mayo Clinic, MN, USA], Dana C Crawford [Vanderbilt University, TN, USA], Wolfgang Sadee [The Ohio State University, OH, USA] and Dalin Li [University of Southern California, CA, USA]); pathway-based approaches (Andrei Rodin [University of Texas, TX, USA] and Lang Li [Indiana University, IN, USA]); novel ideas to improve genome-wide association studies (GWAS; Laura Yerges [University of Maryland, MD, USA], and Jun Yang and Yiping Fan [St Jude‘s Research Hospital, TN, USA]); and statistical issues for the analysis of next-generation sequence data (Terry Speed [University of California at Berkeley, CA, USA]).

Optimizing resources/study designs

A theme repeated throughout this year‘s workshop was the need for alternative approaches to best utilize available samples and maximize statistical power for various study designs. In particular, obtaining sufficient samples for rare adverse drug reactions is a challenge, and so maximizing the use of available samples and prioritizing the genetic variants to study is important. Several analysis methods and novel approaches to identifying functional genetic variants were presented, as well as a reminder of the importance and effectiveness of utilizing samples from large consortia.

An advantage of family-based association tests (FBATs), such as the transmission disequilibrium test, is their robustness against population stratification. This is particularly attractive when combining datasets from multiple centers or studies that may be ethnically heterogeneous. The decreased statistical power of FBAT compared with traditional case–control designs was addressed by Scott Weiss, who described an integrative FBAT approach that combines family data ascertained through affected cases with unselected controls. The proposed method uses both within- and between-family information along with properly matched controls in a case–control comparison. Weiss and colleagues applied their approach to a GWAS using the Childhood Asthma Management Program (CAMP) data with 422 Caucasian nuclear families combined with control data from the Illumina (CA, USA) website. They identified the same gene, PDE4D, that was found in the initial GWAS, as well as new SNPs not detected by the traditional FBAT.

The presentation by Brooke Fridley highlighted the benefits of using both cell lines and patient samples for pharmacogenomics research. Fridley‘s genome-wide approach first evaluated drug cytotoxicity in a cell-line model, followed by validation of statistical associations in a patient sample. Lymphoblast cell lines represent a powerful system for screening genome-wide expression to determine which expression probes are associated with IC50 phenotypes. The IC50 phenotype, the dose of a drug at which 50% of the cells are killed, is commonly used to measure drug sensitivity in cell-line studies, and has the potential to identify biomarkers useful for individualized therapy. This approach highlights that, although cell-line model systems can be used to identify candidate genomic regions, clinical studies that use both experimental and control arms are needed so that gene and drug interaction can be statistically evaluated.

When studying rare adverse drug reactions, it is often difficult to collect a sufficient number of cases, which limits power if the effects of the causal genes are not large. In the case of Torsades de Pointes (TdP), a rare drug-induced long QT syndrome characterized by ventricular tachycardia with a distinct electrocardiogram pattern, the use of available consortium cases provided the necessary subjects for study. Dana Crawford reported preliminary results of a study on candidate genes from potassium, sodium and calcium gene pathways, as well as a GWAS for TdP. Crawford compiled 239 cases pooled from the Leducq Alliance Against Sudden Cardiac Death consortium, including 206 subjects with TdP and 33 subjects without TdP who exhibited a marked increase in QT length when exposed to drugs. Controls came from two sources: 353 drug-exposed controls (i.e., patients treated with QT-prolonging antiarrhythmia drugs or drug-exposed normal volunteers), and 837 population-based controls from the KORA study in Germany. This study evaluated 19 candidate genes based on 1536 SNPs (1413 tagging SNPs, 20 nonsynonymous SNPs, and 103 genomic controls). Analyses contrasted the TdP cases to both sets of controls and were performed with and without the 33 long QT cases. Analyses included only subjects of European descent, and all analyses were adjusted for age and gender. Among all comparisons, a SNP in KCNE1, a potassium voltage-gate channel gene associated with long QT-syndrome 5, was significantly associated with TdP (p-value < 1.8 × 10-4). This result was slightly stronger when restricting the comparison of cases to drug-induced controls (p = 8.4 × 10-5), suggesting that TdP cases may be etiologically distinct from drug-induced long-QT cases without TdP, although further follow-up is required. In a GWAS on the same case–control samples, several novel loci and genomic regions associated with TdP were identified. In the absence of a replication dataset for this rare outcome, other avenues of follow-up were discussed, including tests of function and other consortium opportunities for additional samples.

A more direct way to maximize power is to target polymorphisms that are most likely to affect gene function. This should increase the genetic effect size and reduce the multiple testing burden. Wolfgang Sadee presented a novel approach to detect regulatory factors in candidate genes based on allelic expression imbalance. Three types of functional polymorphisms were considered: cSNPs, which alter protein sequence and function; rSNPs, which alter transcription; and srSNPs, which alter RNA processing and translation. Sadee presented examples ranging from basic research to clinical studies that highlight the importance of regulatory polymorphisms for drug metabolism or drug dosing. Regulatory polymorphisms can have substantial penetrance in human traits and are abundant in candidate genes. To identify regulatory polymorphisms in a candidate gene, the approach starts by measuring mRNA expression of candidate genes in relevant regions, specifically for each allele of autosomal genes (or, in females, for X-linked genes). Any difference between alleles in the expression or processing of the mRNA (termed allelic expression imbalance [AEI]) reveals the presence of regulatory factors in the gene locus. The resultant AEI ratios then serve as quantitative phenotypes to scan for the responsible polymorphisms across the entire gene locus. Next, the molecular genetic mechanism is determined, such as AEI owing to differences in transcription, processing, or splicing – the most difficult step. Finally, clinical association studies are performed.

Dr Sadee presented an example related to CYP3A4 functional polymorphisms. A total of 50% of all drugs are metabolized by CYP3A4, yet the functional SNPs are still unclear. The AEI scanning method was used to search for potential functional SNPs in both introns and exons, and identified an intronic SNP that reduced mRNA expression. This SNP, which had not been reported in previous GWAS and is not in strong linkage disequilibrium with any other SNP, is associated with reduced statin maintenance dose. This promising approach is being developed further to simultaneously interrogate hundreds of SNPs.

To detect gene–environment interactions, Dalin Li proposed an alternative to the conventional case–control or case-only designs. He evaluated a modification of Bayes model averaging (BMA), which can improve power over the conventional designs by averaging over case-only and case–control analyses. However, it suffers from an increased type I error rate when there is moderate violation of the assumption of no association of the gene with the environmental risk factor in the general population. Li‘s extension of BMA averages over all plausible submodels of the case–control model, which includes the case-only model. Based on simulations, Li‘s method appears more powerful than BMA and the case–control analysis when a gene and environment are independent. It is also more robust, with a smaller type I error rate than the BMA and case-only approaches when the independence assumption is violated. Increased power or increased robustness to violations of the independence assumption can be obtained with a more appropriate Bayesian prior specification and increases in the sample size.

Pathway-based approaches

Network modeling of pharmacogenomics data was presented by Andrei Rodin. A Bayesian Network (BN) is a graphical model that represents a set of random variables and their relationships, where relationships are statistically determined by evaluating lack of conditional independence. Rodin presented BN models applied to three examples, including the Genetic Epidemiology of Responses to Antihypertensives (GERA) pharmacogenetics data. His strategy began with variable selection and ranking, followed by building and interpretation of the BN model and model validation. The first step highlights an important limitation of this approach – network models cannot accommodate a large number of variables. Therefore, some form of variable selection must be used for large-scale data. Questions remain regarding the choice of variable selection metrics and procedures. The R package was used to develop BN modeling software specifically aimed at genetic epidemiology data. It incorporates Akaike information criterion and Bayesian information criterion model scoring criteria, and both categorical and continuous variables (e.g., SNPs and metabolite levels) can be modeled by using multinomial or Gaussian probability models, respectively. It also allows for incorporation of prior expert knowledge (constraints). Given the limited scalability, it appears that the BN approach will be most useful for specifically chosen pathways. Results were presented from a study of blood pressure using the top 100 SNPs from a GWAS performed in 195 non-Hispanic whites classified as high and low responders. Varying the complexity penalty explicitly built into Akaike information criterion and Bayesian information criterion criteria can modulate the degree of overfitting (dependency density) in the model, which might highlight SNPs that are related to blood pressure response.

The bootstrap is one way to evaluate the robustness of a particular edge connecting two variables in a BN model. However, care must be taken. Determining important variables is not trivial because it is unclear how to objectively report true bootstrap values when the variables ‘overlap‘ (e.g., total cholesterol, low-density lipoprotein, high-density lipoprotein and triglycerides). Unless the variables are completely independent, the standard bootstrap values might severely underestimate the true strength of association. The R software, which implements a so-called ‘cumulative‘ bootstrap technique, reports a table of pair-wise proportions (i.e., the number of times that a variable appears divided by the number of bootstraps in the variable‘s immediate Markov neighborhood). Future work should consider model space shrinkage and ‘zooming‘ visualization methods to handle the large number of variables, while allowing the user to decide whether to take the intersection or union of the bootstrap values.

Lang Li presented an overview of a physiologically-based pharmacokinetics (PBPK) model that integrates genetics, drug interaction and physiology parameters. Drug exposure predictions from this model can be performed at the individual level. It is a powerful translational tool for in vitro or in vivo prediction, and it can also be utilized for estimation and inference in the clinical setting. He presented ideas on how text mining of PubMed can be used to provide prior knowledge regarding different pharmacokinetic and pharmacogenetic parameters. The advantages of text mining to obtain prior information include potentially unbiased numerical data extraction, false-positive assessment, and data annotations (with the usual limits related to publication biases). Li also described the computational challenge of PBPK model fitting to clinical data, with two ideas on how to speed up the Markov Chain Monte Carlo (MCMC) algorithm. One idea was to use the first two moments to approximate the conditional distributions of subject-specific parameters in the MCMC, called Gibb-MAP. The other idea was to use singular component Gibbs for sampling when two parameters are not locally identifiable. Both methods achieved significant improvements in computational speed. Li emphasized that linking drug exposure with clinical outcomes is key for the application of this modeling framework for personalized medicine. This PBPK model framework serves as one of the building blocks in predicting drug exposure and response at the personal level.

Novel ideas to improve GWAS

The workshop presentations provided several ideas to improve the use of GWAS data. Laura Yerges sought to maximize the information contained in the Heredity and Phenotype Intervention (HAPI) Heart familial study by exploring whether genetic correlation between different measures of platelet aggregation can be used to test for evidence of pleiotropy. The HAPI Heart study was designed to determine the response of four short-term interventions affecting cardiovascular disease risk factors. For the aspirin intervention, Yerges studied 729 participants who did not take aspirin for 14 days and then had baseline blood drawn to measure preaspirin platelet aggregation. Participants then received 81 mg/day of aspirin for 14 days and had blood drawn a second time to measure post-aspirin platelet aggregation. Three different platelet aggregation agonists were studied: the direct agonist arachidonic acid, and two indirect agonists, ADP, and collagen administered at four different doses. Genetic correlation analysis of platelet aggregation provided evidence that some agonists have shared genes responsible for aggregation before and after aspirin. In addition, genetic correlation was high across indirect-agonist measures regardless of aspirin. By contrast, significant genetic correlation between direct and indirect agonists was observed only before aspirin administration. Yerges then examined the relationship between GWAS findings and shared genetic correlation and noted that higher genetic correlation for two traits tended to also have more SNPs jointly associated with both traits. She concluded that different measures of whole-blood platelet aggregation have a shared genetic component, with some genes contributing to variation in platelet aggregation regardless of the agonist used. Moreover, combining information from multiple measures of a trait might enhance genetic signals for association analyses and might be useful in pharmacogenomics research.

Extended analyses of available GWAS data can provide useful insights beyond the initial aims of the original studies, as exemplified by Jun Yang who focused on racial differences in treatment outcome of childhood acute lymphoblastic leukemia (ALL). Although ALL treatment outcome has improved dramatically in recent years, there is a racial disparity for survival of childhood ALL. Prior studies have limitations, including questionable accuracy of self-reported race and the difficulty in classifying individuals with mixed racial backgrounds. Yang used GWAS data for three aims: to define ancestries in children with ALL; to test whether these genetically defined ancestries are associated with treatment outcome; and to identify genetic variants responsible for racial differences in outcome. To accomplish these aims, principal component analysis was performed on 683 ALL cases and 365 unaffected individuals genotyped on the Affymetrix (CA, USA) 600K chip. Unaffected controls included 210 HapMap samples (Yoruba in Ibadan, Nigeria [YRI], Caucasian Europeans in Utah [CEU], Han Chinese in Beijing [CHB] and Japanese in Tokyo [JPT]) and 105 American–Indians. The first three principal components clearly separated out African, East Asian and American–Indian/Hispanic ancestry. Yang found that the Hispanic-prominent principal component score was associated with higher risk of relapse in both cases and controls. This association was validated in a second cohort of 1605 ALL patients. Furthermore, the association remained after adjustment for self-reported race, as well as within self-reported Caucasians, and was prognostic for relapse after accounting for other known prognostic factors. This study illustrated that ancestry-related genetic variations might be responsible for racial differences in outcome of childhood ALL.

Imputation is widely used to combine data from multiple GWAS to increase statistical power and precision of estimates. While imputation for homogeneous ethnic populations is well established, imputation for mixed populations, such as African–Americans or Hispanics, has not been thoroughly explored. Yiping Fan presented an overview of imputation for a cohort of mixed ancestry children with leukemia. Fan used MACH software Citation[1] to impute untyped markers and assess the imputation error rate using 100K and 500K Affymetrix data. A set of SNPs with call rates greater than 0.95 and minor allele frequency greater than 0.01 were utilized. For imputation, reference haplotypes from the four HapMap ethnic populations were used (90 CEU, 90 YRI, 45 CHB and 45 JPT). The program STRUCTURE was used to classify the patients into ethnic groups: those with more than 80% YRI ancestry were classified as ‘Black‘; those with more than 90% CEU ancestry as ‘White‘; and those with more than 90% CHB and JPT ancestry, as ‘Asian‘. Subjects who did not meet these criteria were classified as ‘Other‘. For this latter group, pooled HapMap samples (CEU, YRI, CHB and JPT) were used as a reference population. A total of 69 of the 450 subjects studied were classified differently to their self-declared race. MACH was run with 365,125 SNPs from the Affymetrix 500K platform and the HapMap reference samples to impute more than 1 million untyped SNPs. Approximately 80,000 imputed SNPs that were on the Affymetrix 100K array but not on the 500K array were used to calculate error rates. To measure imputation accuracy, the estimated probability (p-value) that an average imputed genotype matched an experimental genotype was obtained from MACH. Genotype error rates were calculated using all genotypes with p-value of 0.95 or more. In general, the imputation was worst among the African–American samples (error rate of 3.2% for p ≥ 0.95). SNPs with the lowest maximum R2 between the imputed and typed SNPs had the highest error rates. Overall, the error rate in a general admixed population of Americans was approximately 1.5%, suggesting that this is a reasonable approach for increasing the number of SNPs without direct genotyping. This study highlights the value of imputation for GWAS and reiterates the importance of evaluating the quality of imputed SNPs. Given the probabilistic nature of imputation, it was suggested that a Bayesian model could be used to incorporate the imputation error rate when analyzing measured and imputed data in GWAS.

Next-generation sequence data

The invited keynote speaker, Terry Speed, discussed statistical challenges with next-generation sequence data. After a brief overview of some of the sequencing technology of the Illumina genome analyzer, Speed focused on issues arising in the analysis of transcript levels of mRNA (i.e., mRNA-Seq technology). An important concern is accounting for sources of variation in the measured transcript levels, which are found as counts of short reads. A typical scientific question is whether transcript levels differ between experimentally defined groups, such as cancer versus no cancer (i.e., between-group variation). However, it is critical to account for variation caused by technical aspects of the assay, such as individual effects (between samples within a group), library-preparation effects (caused by the process of creating cDNA fragments that are fed into the machine), flow-cell effects (samples are placed on different flow cells that correspond to separate runs of the machine) and lane effects (each flow cell has eight lanes). To account for these sources of variation, multiplicative Poisson regression models were used. Application of these methods to a set of data that included Stratagene‘s (CA, USA) universal human reference and human brain as sources of mRNA demonstrated that the Poisson models fit quite well. Further use of the Poisson model to look for regions enriched for expression relative to the local background was illustrated by novel regression approaches. Many of the details of Speed‘s presentation can be found in a working paper by Bullard et al.Citation[2]. Overall, Speed‘s presentation highlighted the rapid advancements of technology, and the struggles of scientists to keep up with the technology and understand the analysis tools.

Summary & conclusion

As illustrated by the variety of presentations at the PGRN Analysis Workshop, pharmacogenetics research faces some unique challenges – such as limited numbers of subjects with adverse drug reactions – that require innovative study designs and analytic approaches. Not too long ago, the conventional pharmacogenomics approach entailed evaluating a limited number of candidate genes using conventional designs and statistical methods. However, results from such studies have not been very promising. Recent technological advances have rapidly shifted the field to much broader approaches, using a variety of statistical methods, study designs, platforms and sample types. It takes time to fully investigate new analysis tools and their limitations, yet we do not always have the luxury to do this before the technology changes yet again. Pathway analysis keeps coming up as a potential added tool for gene discovery, but we are still limited by our understanding of the biology. It is clear, however, that as we proceed with the analysis of extensive pharmacogenomic data, our need for a more complete understanding of biological complexity is increasing.

The challenges for the future are great. As we start to extract multiple data types (e.g., SNPs, copy-number variation and DNA or RNA sequencing), how can we optimally combine all this information? Collecting new samples is often impossible or prohibitive, so how can we best capitalize on existing data and samples? What should we do when genomic studies of rare events cannot be replicated? How do we confirm results or further the investigation? How do we deal with multiple small effects, or rare variants, that are buried among random noise, especially when sample sizes are generally small? Finally, where does epigenetics fit into all of this? As we investigate these and many other questions, workshops such as that held by the PGRN will continue to play an important role in helping to advance pharmacogenomics research.

Information resource

▪ Bayesian Belief Network Software (R libraries with a GUI) is available by contacting: [email protected]

Financial & competing interests

The PGRN Analysis Workshop was supported by the NIH Pharmacogenetics Research Network (PGRN, www.nigms.nih.gov/pharmacogenetics), and by the US Public Health Service, National Institutes of Health, Pharmacogenetics Research Network contract grant number UO1 GM61388 (Richard M Weinshilboum, MD, principal investigator). The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

No writing assistance was utilized in the production of this manuscript.

Additional information

Funding

The PGRN Analysis Workshop was supported by the NIH Pharmacogenetics Research Network (PGRN, www.nigms.nih.gov/pharmacogenetics), and by the US Public Health Service, National Institutes of Health, Pharmacogenetics Research Network contract grant number UO1 GM61388 (Richard M Weinshilboum, MD, principal investigator). The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed. No writing assistance was utilized in the production of this manuscript.

Bibliography

  • Li Y , AbecasisGR: Mach 1.0: rapid haplotype reconstruction and missing genotype inference.Am. J. Hum. Genet.S79 , 2290 (2006)
  • Bullard JH , PurdomEA, HansenKD, DurinckS, DudoitS: Statistical inference in mRNA-Seq: exploratory data analysis and differential expression. University of California, Berkeley Division of Biostatistics Working Paper Series, Paper 247 (2009).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.