1,747
Views
1
CrossRef citations to date
0
Altmetric
Review

A systematic review of datasets that can help elucidate relationships among gene expression, race, and immunohistochemistry-defined subtypes in breast cancer

ORCID Icon & ORCID Icon
Pages 417-429 | Received 26 Jan 2021, Accepted 06 Jul 2021, Published online: 19 Aug 2021

ABSTRACT

Scholarly requirements have led to a massive increase of transcriptomic data in the public domain, with millions of samples available for secondary research. We identified gene-expression datasets representing 10,214 breast-cancer patients in public databases. We focused on datasets that included patient metadata on race and/or immunohistochemistry (IHC) profiling of the ER, PR, and HER-2 proteins. This review provides a summary of these datasets and describes findings from 32 research articles associated with the datasets. These studies have helped to elucidate relationships between IHC, race, and/or treatment options, as well as relationships between IHC status and the breast-cancer intrinsic subtypes. We have also identified broad themes across the analysis methodologies used in these studies, including breast cancer subtyping, deriving predictive biomarkers, identifying differentially expressed genes, and optimizing data processing. Finally, we discuss limitations of prior work and recommend future directions for reusing these datasets in secondary analyses.

Introduction

In clinical practice, breast tumors are classified into histopathological subtypes based on immunohistochemistry (IHC) profiles. These classifications are useful for determining a patient’s diagnosis, estimating a patient’s prognosis,Citation1 as well as determining appropriate patient-management strategies and therapies.Citation2,Citation3 The IHC subtypes are defined by the combined expression of three cell-surface proteins: estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER-2).Citation4,Citation5 These cell surface receptors, which are also referred to as IHC markers, are defined as positive or negative, depending on the level of expression of the receptors. While ER and PR are bound by estrogen and progesterone, respectively, there are no known ligands for HER-2, although it can be transactivated by epidermal growth factor (EGF)-like ligands.Citation6 The activation of these cell surface receptors by the binding of ligands, results in a cascade of reactions that culminates in transcriptional regulation of genes involved in modulating biological activities such as cell proliferation and metastasis.Citation7–9

An understanding of IHC status in relation to gene expression is an important step toward understanding mechanisms behind how these markers drive breast-tumor biology and toward providing better treatment options for breast-cancer patients. Relatively recent research has uncovered “intrinsic” molecular subtypes of breast tumors, identified via gene-expression profiling.Citation10 These gene-expression patterns reflect differences in transcriptional activity in the tumors across approximately 50 genes; these subtypes have also been correlated to clinical outcomes.Citation11,Citation12 The intrinsic subtypes, to a large extent, overlap with the IHC markers (see ). Although the intrinsic subtypes and IHC markers are both useful and provide insights about breast-tumor biology, this review focuses on the IHC-based markers because they are used widely in clinical practice and because data for these markers are widely available in the public domain. However, later in the review, we briefly describe research on relationships between IHC markers and the intrinsic subtypes, thus helping to connect these two informative types of molecular data.

Table 1. Immunohistochemistry classifications and related intrinsic subtypes. Here we provide a mapping between the immunohistochemistry classifications used in breast cancer diagnosis/treatment and commonly associated intrinsic subtypes. While these may routinely overlap, there may be instances in which they do not.

Within the overall population of breast-cancer patients, a subgroup of tumors referred to as triple-negative breast cancers (TNBC) have been associated with more aggressive clinical behavior and worse outcomes than other breast tumors.Citation13 TNBCs lack ER and PR expression, as well as HER-2 amplification; they constitute 10–20% of diagnosed breast cancers.Citation14 Some of the characteristic features of TNBCs include earlier age at onset, more advanced stage at diagnosis, as well as aggressive tumor phenotypes.Citation15 There are currently no targeted drug therapies for managing TNBCs; thus they are currently managed using chemotherapies like anthracycline, taxanes, antimetabolites, platinum agents, and novel microtubule stabilizing agents.Citation16 However, only 31% of TNBC patients experience pathological complete responses after chemotherapy,Citation17 which demonstrates the need for targeted therapies. TNBCs also have a higher tendency to metastasizeCitation18,Citation19 and are associated with a higher rate of relapse and poorer survival than non-TNBCCitation20,Citation21 tumors. When compared with TNBCs, tumors that are ER+/PR+/HER2− are more responsive to hormonal treatments,Citation22 while tumors that are ER−/PR−/HER2+ are better treated with anti-HER2 therapies.Citation23 The incidence of TNBCs is disproportionately higher in African American (AA) women than European American (EA) womenCitation15 with 20.8% of AA breast cancers being diagnosed as TNBC compared to 10.4% in EA women.Citation24 Also, the mortality rate of AA women diagnosed with TNBC is 42% higher than that of EA women.Citation25 On the other hand, Asian women present with less-advanced stages and have better survival (90% after 5 years) when compared against non-Hispanic whites.Citation26,Citation27 Such disparities have led researchers to study in more detail what factors may be associated with race in breast cancer. One way they have approached this task is via gene-expression analysis.Citation28,Citation29 Evaluating gene-expression differences between individuals of different ancestry may help to elucidate biological mechanisms behind those differences and thus help to explain why clinical disparities occur.

Due to requirements of research journals and funding agencies to make research data public, thousands of datasets are hosted in public databases across the world by various institutions. Though these datasets are publicly available, they are scattered across multiple repositories, have been generated using different technologies, and have inconsistent metadata. Even though such data are in the public domain, and publications that arose from them are mostly available, researchers must undertake time-consuming efforts to find datasets that are relevant to a given topic of interest and summarize the data. This review helps to fill this void by identifying gene-expression datasets related to breast cancer, IHC status, and race. We have systematically identified 20 distinct datasets representing a total of 10,214 patient samples that align with these themes. We have examined patterns that cut across 32 publications associated with these datasets, and we provide detailed summaries about the datasets.

Methods

We searched for gene-expression datasets in the public domain related to breast cancer. More specifically, we focused on datasets that included metadata for at least some patients indicating IHC status (ER, PR, HER-2), race, or both. We focused on data that had been generated using high-throughput methods (e.g., microarrays, RNA-Sequencing) and searched the largest available gene-expression repositories: Gene Expression Omnibus (GEO),Citation30 ArrayExpress,Citation31 and Oncomine.Citation32 We identified unique datasets within these repositories that matched specific search criteria. For GEO, our search parameters were ((((“her2”) OR “estrogen receptor”) OR “progesterone receptor”) AND “race”) AND “breast cancer”. In ArrayExpress, our search parameters were “Breast Cancer”, “her2”, “race”, “estrogen receptor”, and “progesterone receptor”. Until recently, ArrayExpress maintained a copy of all GEO datasets, so all the data returned from ArrayExpress overlapped with what we already had identified in GEO. Oncomine is a commercial repository, but researchers can perform dataset searches for free after creating an account. We used this resource as an additional search tool. In Oncomine, under the Primary Filters tab > Cancer Type > Breast Cancer, we selected “Breast Carcinomas.” Next, under Sample Filters > Demographics, we selected “Race/Ethnicity”. Finally, under Dataset Filters > Data Type, we selected “mRNA” as the data type. For each dataset that we found in Oncomine, we searched for a publicly available version of the data either in GEO or ArrayExpress. After aggregating our results across these repositories, we performed additional filtering. We excluded datasets that did not have a corresponding publication as well as datasets that lacked metadata on both race and IHC receptor status. We also excluded one dataset that contained IHC status for cell lines because we wished to focus on primary tumors only. We were only interested in race and IHC receptor status, so we did not consider other metadata as filtering criteria.

Results

Datasets

We collected information about the journal article(s) associated with each dataset; this information includes author name, PubMed ID, data source, number of samples, availability of raw data, gene-expression profiling technology used, and the type(s) of metadata available for each dataset (). After filtering, we identified a total of 20 datasets comprising 10,214 samples. Fourteen of the datasets had raw data available. Fifteen datasets comprising 2231 samples were from microarrays (10 used Affymetrix technologies; 5 used Agilent technologies), while 5 were based on RNA sequencing (Illumina). Depending on the platform used, the raw data are CEL files (Affymetrix microarrays), XML or .txt files (Agilent microarrays), or FASTQ files with read sequences and quality scores from next-generation sequencing data.

Table 2. Gene-expression data sources. This table summarizes all gene-expression datasets that we found in the public domain related to breast cancer and immunuohistochemistry status and/or race. Most identifiers in the data source column reference data series from Gene Expression Omnibus; exceptions are noted. The raw data column indicates whether raw data files (for example, CEL or FASTQ) were available for each dataset. The race status and IHC Status columns indicate whether race and/or IHC status was available for at least some patients in each dataset. NA = Not Available.

The minimum number of samples per dataset was 24 (Chang et al.Citation36), while the maximum number was 3678 (Brueffer et al.Citation33). The median number of samples per dataset was 156. All 20 datasets provided IHC receptor status for at least some patients (). In total, there were 6635 ER+ samples, 4192 PR+ samples, and 1309 HER2+ samples. There were 2938 ER- samples, 3631 PR- samples and 6038 HER2- samples. The process of determining IHC status is subjective; sometimes histopathologists are unable to clearly determine IHC status, or they disagree about the status of a sample. This was apparent for some patients in which receptors had been examined but IHC status was labeled as “unknown”.Citation33,Citation34,Citation37–39,Citation53 In other cases, IHC status was provided for some markers but not for others. In total, ER status was missing for 591 samples, PR for 1978 samples, and HER-2 for 1720 samples.

Table 3. Datasets with immunohistochemistry information.

Fifteen datasets had race information for at least some samples. These datasets represented a total of 3007 samples (), although race status was unavailable for some samples in these datasets. We classified the datasets based on high-level, racial/ethnic categories: Asian, Black, Hispanic, Mixed, or White (not Hispanic). Patients with a race/ethnicity classification that didn’t fit into any of these categories were placed in an “Other” group, while samples with no information about race/ethnicity are listed as “NA”. In total, data were available for 225 samples from Asians, 383 from Blacks, 145 from Hispanics, 2032 from Whites, and 33 from Other. For 186 samples in these datasets, no race/ethnicity information was available. Race information was used for research purposes differently across the studies. For example, sometimes gene-expression data was used to compare races,Citation43while in most studies, race information was collected as one of multiple patient characteristics.

Table 4. Distribution of samples within datasets with race information.

Thirteen datasets provided both IHC receptor status and race information (). By examining datasets that have both types of metadata, researchers may be able to shed light on associations among gene-expression levels, IHC receptor status, and race, including relationships between TNBC and race. Hippen Citation113

Each of the datasets we identified has been analyzed in at least one primary research article. We examined these articles to identify research questions that have been addressed using these datasets and summarized the articles based on common methodologies used to analyze the data (). In some cases, we also cite articles not associated with these datasets to provide additional context. Our findings are described in the following sections.

Table 5. Themes across journal articles.

IHC receptor status and breast-cancer patient outcomes

It is widely accepted that the IHC status of breast tumors has prognostic significance;Citation60 these markers are commonly used in determining the choice of therapy, including endocrine therapies, chemotherapies, or combinations of drugs.Citation61–63 Generally, breast cancer patients with ER+/PR+ tumors have a lower risk of mortality when compared to women diagnosed with ER+/PR-, ER-/PR+ or ER-/PR- tumors (p < .05, hazard ratio 95% CI [2.2–2.4]).Citation64 Women with ER+ and PR+ tumors also have a greater chance of survival when treated with chemotherapy regimens such as letrozole or tamoxifen.Citation65 More specifically, Prat et al.Citation66 showed that HER2+ tumors benefit from being treated with trastuzumab-based chemotherapy, with 44% showing a complete pathologic response (pCR), which is defined as the complete and total absence of residual invasive tumor in the breast and axillary lymph nodes.Citation66 Altogether, these findings illustrate the importance of IHC status in the treatment, management, and outcomes of breast cancer.

Race and breast-cancer diagnoses and outcomes

The influence of racial differences on the incidence and mortality rates of breast cancer are not well understood. Some studiesCitation67,Citation68 have highlighted potential multifactorial causes of these disparities, such as access to healthcare, socioeconomic status, treatment delays and tobacco use; but adjusting for these factors does not fully explain these disparities.Citation69 Even in high-income countries, such as the US and the UK, disparities exist. In the US, AA women have a 41% higher breast cancer-related death rate than do EA women; the 5-year relative survival is 74% in AA women but 86% in EA women.Citation70 In the UK, 5-year, breast cancer relapse-free survival is 62.8% for black women, compared to 77% for white women, despite equal access to healthcare.Citation71,Citation72 The Surveillance, Epidemiology and End Results Program Cancer Statistics Review (1975–2014) reports that up to the age of 45 years, population-based incidence rates of breast cancer are higher for AA compared to EA women; furthermore, the median age at breast cancer diagnosis is 63 years for EA patients compared to 59 for AA patients.Citation73 The median age of breast cancer patients in Africa tends to be younger (60 years) than those in Europe and North America (63 years), although this younger age distribution may reflect the shorter life expectancy for populations in low and middle income countries.Citation74 Smith-Bindman et al.Citation75 argued that a possible explanation for these differences is that AA women are less likely to receive routine mammography screening and therefore tend to have larger tumors at a more advanced stage, potentially allowing the accumulation of deleterious mutations in the tumors. Among Asian populations, patients from Japan present with less advanced stages of breast cancer and have better survival rates than non-Hispanic whites.Citation76 Hispanic women also have a lower incidence of breast cancer than non-Hispanic white women.Citation71 Finally, populations of African descent are diagnosed disproportionately with TNBCs when compared to populations of European descent,Citation77 whose breast tumors are more frequently ER+ and/or PR+.

With these statistics in mind, it is imperative that scientists and researchers develop new strategies and approaches to improve outcomes in these populations that are more adversely affected. Survival rates may be increased and mortality may be reduced through epidemiological efforts to encourage habits that will help in breast cancer prevention, such as early detection. In addition, a deeper understanding of the molecular mechanisms that drive these differences can advance efforts to aid these populations.

IHC receptor status as a variable in gene-expression studies of breast cancer

Medical oncologists often determine treatment and management strategies for breast-cancer patients based on the expression of IHC markers. However, the methodologies used to identify these markers often vary between laboratories, are subjective, and are not always reproducible.Citation78,Citation79 Among the studies we examined, four evaluated the use of gene-expression data to guide IHC status assessment,Citation33,Citation48,Citation55,Citation56 Brueffer et al.Citation33 tried to predict IHC status using machine-learning classifiers and achieved an accuracy of 95.3% for ER, 90.4% for PR and 88.5% for HER2. Lu et al.Citation48 were able to predict ER+ and HER2+ tumors with 98% and 93% accuracy using the recursive support vector machine algorithm. Popovici et al.Citation55 and Shi et al.Citation56 also developed multi-gene predictors for ER+ status, and their accuracy ranged from 87% – 93%. In most other studies, IHC status was collected in addition to other clinical factors but was not directly correlated to gene-expression levels. Some studies used IHC status to assess similarity in the relative risk of breast cancer mortality among different patient subsets. For example, Dunnwald et al.Citation64 found that women who had ER+/PR+ tumors had lower risk of mortality compared to women who had ER+/PR-, ER-/PR+, or ER-/PR- tumors; this risk was independent of demographic characteristics. On the other hand, Creighton et al.Citation38 used reported IHC status as a way to estimate the most appropriate therapy to administer, with patients receiving endocrine therapy if they were ER+/PR+. The use of more objective and consistent methods for determining IHC status, such as via gene-expression profiling, could increase confidence in the determination of treatment and management strategies.

Race as a variable in gene-expression studies of breast cancer

Although many studies provided race information alongside gene-expression data, race was used as the primary variable only in the study by Linder et al.Citation43 Focusing on gene-expression data from TNBC patients, their unsupervised analysis revealed a basal-like gene signature that was differentially expressed between AA and EA patients. AA tumors also had a higher genomic grade when compared to EA tumors.

There are conflicting reports on gene-expression differences between AA and EA women. Some studies have used gene-expression data (although not publicly available) to compare different racial populations. For example, Sturtz et al.Citation80 confirmed that the frequency of TNBC was significantly higher (P < .001) in African Americans (28%) when compared to Caucasians (12%), although principal component analysis (PCA) did not detect significant gene-expression differences between these populations.

There have also been documented differences in mortality rates between populations. For instance, Huo et al.Citation81 analyzed data from 154 AA patients and 776 EA patients in The Cancer Genome Atlas (TCGA). In addition to identifying 142 genes that were differentially expressed between these populations, they observed different survival outcomes between breast-cancer patients of European ancestry and African ancestry, confirming prior evidence that patients of African Ancestry have worse overall outcomes. They also observed other molecular differences between these populations, with AA patients having more TP53 mutations and fewer PIK3CA mutations than EA patients.

Chavez-MacGregor et al, evaluated gene-expression data for 376 samples, but reported no significant differences in RNA expression among tumors from black, white and Hispanic individuals.Citation82 They reported that unsupervised clustering of significant probe sets showed differences according to races; yet, their validation set did not confirm this pattern. A possible explanation for these conflicting reports is the small number of samples that were used in their analysis, thus emphasizing the need for large-scale, gene-expression analysis studies that compare different ancestral groups.

Analysis methodologies

Differential gene-expression analysis

Several studies have attempted to identify genes that are expressed differently between breast-cancer subgroups. Chang et al. compared expression profiles of patients who had experienced resistance or an incomplete response to docetaxel against patients who responded well to this drug.Citation36 Their data suggest that the expression of genes in the mTOR pathway may be responsible for a lack of response to docetaxel. Creighton et al.Citation38 examined the relationship between obesity and breast-cancer outcomes. In comparing obese and non-obese patients, they discovered gene-expression patterns associated with obesity that correlated with increased insulin-like growth factor (IGF) signaling. This was associated with worse outcomes in the obese patients.

Lindner et al.Citation43 sought to shed light on gene-expression differences between EA populations and populations of African descent in TNBC patients. They performed an unsupervised analysis and identified a basal-like gene-expression signature that was differentially expressed between these populations. This basal-like signature was strongly associated with increased expression of insulin-like growth receptor 1 (IGF1). Thus, therapeutic regimes that target the IGF1 pathway may be beneficial to AA patients.

Overexpression of HER2+ in breast tumors may lead to tamoxifen resistance,Citation83 though the mechanisms of this resistance are poorly understood. Loi et al.Citation44 performed a differential gene-expression analysis and concluded that independently of HER2 overexpression, activation of growth factor (GF) signaling pathways in patients with ER+/HER2+ tumors contributes to poor prognosis for patients with this molecular subtype.

When identifying differentially expressed genes, sample sizes should be considered. Small sample sizes can lead to false negatives and inconsistent findings between studies. Ching et al. used a simulation analysis to estimate sample sizes that are needed for one-factor, two-sample designs such as those described above.Citation84 They showed that statistical power is conditional upon the dataset and experimental conditions. For example, when replicates in a single group were homogeneous and differences between the groups were large, a power value of 0.8 was attainable for sample sizes as small as 5 or 10. However, for more complex phenotypes, such as differences between human populations, sample sizes of at least 25 were necessary. With the exception of Chang et al., which used 24 samples, differential-expression analyses exceeded these thresholds. When statistical power is lacking, researchers may be able to combine datasets in gene-expression meta-analyses.Citation85

Predictive biomarkers

Multiple studies have attempted to use gene-expression profiles to predict patient characteristics and outcomes. Brueffer et al.Citation33 predicted IHC receptor status for breast tumors using single- and multi-gene classifiers. They attempted to discriminate between tumors that exhibited either positive or negative expression of these markers. The accuracies of the models were as follows: ER, 95.3% ± 2.4%; PR, 90.4% ± 2.9%; and HER2, 88.5% ± 3.8%. In addition, they made predictions for Ki67 (84.9% ± 3.4%); and NHG, (73.8% ± 3.9%),Citation86,Citation87 two other proteins that are sometimes used to make clinical decisions for breast-cancer patients. Being able to predict IHC status from gene-expression patterns could help in cases where pathologists are unable to determine IHC status unequivocally, thus informing possible treatment courses. Predicting response to therapeutic regimens can also help inform physicians about which regimens a patient will respond to or be resistant to. Chang et al.Citation88 predicted responses to docetaxel using a linear classifier and achieved 88% accuracy (95% C.I. = 68–97%). Julka et al.Citation41 predicted responses to neoadjuvant drug combinations: gemcitabine + doxorubicin and gemcitabine + cisplatin. Using a k-nearest neighbors classification approach, they achieved a level of accuracy between 73–78%. Ayers et al.Citation89 achieved 78% accuracy (95% C.I. = 52–94%) for predicting responses to paclitaxel followed by a combination of fluorouracil + doxorubicin + cyclophosphamide. Although the reported accuracies for these studies are somewhat comparable, it is difficult to make legitimate comparisons from one study to the next because validation strategies (e.g., cross validation or use of an independent test set) are not standardized across studies. Furthermore, different quantitative methods were used in these prediction attempts. These included diverse machine-learning algorithms, but there was no consensus method as each group selected algorithm(s) based on what they deemed to be most suitable to their needs.

Among the studies we evaluated, two used molecular data integration to predict patient characteristics. Huang et al. and Chin et al. both integrated gene-expression measurements with genome copy-number data. Huang et al.Citation40 identified genes with copy number variations that were differentially expressed based on ER and HER2 status. These genes were then used to derive gene-expression signatures associated with disease-free survival. They built a survival prediction model based on 16 genes. Chin et al.Citation37 identified gene markers that predicted reduced survival as well as those that could act as possible therapeutic targets. In addition, their research showed that patterns associated with copy-number abnormalities differed among gene-expression based tumor subtypes. However, no gene markers overlapped between these two studies. Future studies that continue the trend of integrating data from multiple assay types may lead to a better understanding of how different types of molecular aberrations contribute jointly to the development of breast cancer and may improve the concordance of analysis results.

Bioinformatics (data processing methods)

Publicly available data have been used to evaluate processing methods for RNA-Sequencing data. Rahman et al.Citation53 observed in experimental replicates that a commonly used pipeline for quantifying gene-expression levels resulted in high variability among cell lines in which HER-2 was activated. This pipeline had been used to process data from TCGA, which is used widely for research analyses; these findings raised concerns that inferences made from TCGA data would be inaccurate. In addition, the pipeline produces normalized gene-expression values, whereas the DESeq2Citation90 algorithm commonly used for differential-expression analyses requires that expression levels be represented as raw read counts. To address these concerns, they used the Rsubread packageCitation91 to reprocess the data for 9,264 tumor samples across 24 cancer types, including 1119 breast tumors.Citation53 In this way, IHC status was used as an external validation measure, which may be a useful approach in other studies. Finally, the authors curated clinical data from TCGA, including IHC status and race information, as a way to make it easier for the scientific community to make inferences regarding these variables.

Breast cancer subtypes

The IHC classification scheme presently in use, while useful, is still too broad.Citation92 For example, individuals of the same IHC subtype may not benefit from the same regimens. Several studies,Citation54,Citation63 have used gene-expression data in an attempt to derive more specific breast cancer subtypes. The use of gene-expression data to define subtypes may allow individualized treatment by characterization of molecular profiles unique to these subtypes. Thus specialized treatments could be tailored to these profiles in a more specific way than using IHC status alone.

Burstein et al.Citation34 focused their analysis on TNBCs. Using the Differential Expression via Distance Summary algorithm,Citation93 they identified a subset of genes associated with TNBC status and then used non-negative matrix factorizationCitation94 to aggregate TNBC patients into four clusters. These clusters formed the basis for discrete TNBC subtypes, each with a different prognosis. They labeled these subtypes as luminal androgen receptor, mesenchymal, basal-like immunosuppressed (BLIS), and basal-like immune-activated (BLIA). BLIS tumors had the worst prognosis, while BLIA tumors had the best prognosis. This classification differs from that of Curtis et al.Citation39 who used joint clustering of copy-number and gene-expression data to define an ER+ subgroup, a luminal A subgroup, a subgroup that includes both ER+ and ER- cases, and several subgroups that consisted of mostly ER+ cancers as well as HER2-enriched (ER-) cases. In contrast, Loi et al.Citation45 focused on ER+ breast cancer cases. Using an algorithm called the “gene expression grade index” (GGI), which summarizes the similarity between gene-expression profiles and tumor grade, they attempted to separate ER+ tumors into two distinct and prognostic molecular subtypes. Even though using the GGI method showed some similarity with previous studies,Citation95,Citation96 they suggested that it provides specific benefits; first, it is simply derived by averaging the expression levels of 97 genes; secondly, the biological function of the selected genes are well understood due to their correlation with histologic grade.

Although these differences in classification may be due to the methodologies as well as the selected datasets used, there is some overlap between the different classification schemes. Identifying biomarkers in these ways may provide a basis for directed clinical research and make it easier to develop targets for more effective treatment.

Relationship between IHC status and intrinsic subtypes

This review has addressed what can be learned about breast cancer based on IHC subtypes. We also mentioned the “intrinsic” subtypes, which are based on gene-expression levels and have shown promise for use in clinical settings.Citation95,Citation97 Unfortunately, these two systems for subtyping breast tumors are sometimes inconsistent with each other, reaching agreement levels of 75% or lower for some subtypes.Citation98,Citation99 By integrating IHC status with gene-expression data, it may be possible to derive aggregate subtypes that combine evidence across these two systems. Indeed, ER status is already used to guide the gene-centering normalization approach used for assigning tumors to the intrinsic subtypes.Citation10,Citation100 However, this approach relies on a balance between ER+ and ER- tumors. Raj-Kumar et al.Citation101 applied principal component analysis to gene-expression data from Curtis, et al.Citation39 and Koboldt et al.Citation53,Citation54 to evaluate overlap among the subtypes assigned by these two systems and to derive integrated subtypes that ensure an ER balance. Their approach shows promise as a way to reconcile these methods and derive more clinically relevant subtypes.

Discussion

We have summarized 32 studies that reported findings from 20 distinct datasets representing a total of 10,214 patient samples. We have illuminated diverse types of research that have been done with these data and emphasized findings about relationships among IHC receptor status, race, and gene-expression levels in breast cancer. To the best of our knowledge, no prior review has emphasized this combination of topics. Some research articles have reported findings from multiple datasets related to breast cancer and IHC receptor status. However, these have been restricted to relatively few datasets or have had a relatively narrow focus. For example, Roelands et al.Citation102 curated a compendium of 13 publicly available breast cancer datasets covering 2,142 cases, but they did not make inferences regarding race or IHC status. Karn et al.Citation103 extracted data from 15 publicly available datasets for 394 TNBC samples and an independent validation set of 261 samples; using this data they derived prognostic gene signatures. But they did not take race into account when developing these signatures. Gong et al. attempted to identify breast cancer targets using 5 datasetsCitation104 covering 257 breast cancer patients and 98 normal controls by identifying differentially expressed genes between these groups and then mapping them to genes in the Thomson Reuters Integrity Database (a database that hosts drugs, their gene targets, and diseases associated with them); again, that study made no reference to race or IHC status.

The studies associated with datasets in this review differed widely in the research questions that they addressed. For the few studies that addressed the same research questions, we observed a mix of concordance and discordance. Brueffer et al.Citation33 and Lu et al.Citation48 achieved similar levels of accuracy when predicting IHC status using machine-learning classifiers, even though these samples were processed on different expression-profiling platforms. Huang et al.Citation40 and Chin et al.Citation37 combined gene-expression measurements with copy-number variations. Both identified instances in which copy number was correlated with gene-expression levels, and both studies associated these observations with patient survival; however, the genes that they identified were inconsistent with each other, perhaps in part because their analysis methodologies differed considerably.

Different studies use different gene-expression profiling platforms (e.g., microarray and RNA-Sequencing), data-processing methodologies, prediction algorithms, etc. These differences can present challenges, especially when researchers wish to combine datasets. Perhaps most notably, batch effects and other technical artifacts introduce noise.Citation105 Software tools are available for modeling and removing batch effects and other unwanted variation;Citation106 however, adjusting for platform differences remains a challenge. For differential gene-expression analyses, tools exist for combining datasets from multiple platforms; these include rank-based methods and p-value aggregation techniques.Citation85,Citation107 However, for other types of analyses, such as multi-study machine learning,Citation108 methods are less well developed.

Publicly available datasets differ widely in the clinical metadata associated with them.Citation109 Curation efforts such as the MetaGxBreastCitation110 package can help to alleviate this problem, providing harmonized datasets that can be analyzed more readily. However, this resource is limited to a relatively small subset of available breast-cancer datasets and lacks metadata about race. Expansive, curated resources that place more emphasis on underrepresented populations could help to provide new insights into the interplay among race and breast-cancer subtypes. Furthermore, data collection and sharing could be improved – and better secondary research could be facilitated – if researchers provided a standard set of clinical variables alongside gene-expression data. For breast cancer, such variables might include IHC status, the patient’s race and ethnicity status, age at diagnosis, tumor grade, and treatment responses. Better still would be to provide data about patients’ environmental contexts, but few studies collect or share environmental variables alongside gene-expression data. Environmental factors can alter gene-expression levels and thus should not be overlooked in disease studies.Citation111 Approaches used by agricultural and evolutionary geneticists can help shed light on how an individual’s environment influences gene-expression levels and how these factors interact with a person’s genetic background. For example, researchers might study individuals who are genetically similar but live in diverse geographical regions and have differing lifestyles.Citation111

Most studies in this review focused on gene-expression levels in tumor tissue. When evaluating differences in expression between two races, questions may arise as to whether those differences are specific to breast tumors or whether those differences are due to the patients’ genetic background alone. With the exceptions of two datasets, none included gene-expression measurements for normal tissues. Rahman et al. provided a reprocessed version of data from The Cancer Genome Atlas (TCGA). Curtis et al. described data from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC). In both cases, normal data were provided alongside tumor data, enabling researchers to perform diverse types of research with the data; however, to our knowledge, neither dataset has been used to evaluate gene-expression differences between races in non-tumor conditions. We suggest that where possible, in future studies, authors include non-cancer gene-expression profiles as a way to control for baseline differences between groups.

We live in an era when gene-expression data are widely available in the public domain. Exciting opportunities exist for reusing data to derive new insights. For example, by examining patterns that span multiple datasets, researchers can increase sample sizes and ask new questions that would otherwise be infeasible to ask. However, even the process of finding available datasets can be a challenge.Citation112 By summarizing publicly available gene-expression datasets that have IHC and/or race metadata and providing an overview of findings that have already been made, this review lays the groundwork for future studies that curate, combine, and analyze these datasets. For example, identifying genes that are differentially expressed across many studies between IHC-based subgroups or different populations could provide more insights about the mechanisms that drive these subtypes and guide efforts to develop targeted treatments. As we have noted, IHC and race/ethnicity status can be ambiguous to determine in some cases. IHC status often cannot be assigned a definitive positive or negative value, and individuals’ ancestries are often mixed or unknown. Using gene-expression data to impute missing values or clarify ambiguous values could make datasets more complete113 and be useful in clinical settings. In addition, researchers may be able to develop more robust prognostic signatures – via larger sample sizes and taking covariates such as IHC status, race, and age into account – and ensure that the models generalize to independent datasets.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

References

  • Sørlie T, Wang Y, Xiao C, Johnsen H, Naume B, Samaha RR, Børresen-Dale A-L. Distinct molecular mechanisms underlying clinically relevant subtypes of breast cancer: gene expression analyses across three different platforms. BMC Genomics. 2006;7(1):127. doi:10.1186/1471-2164-7-127.
  • Schnitt SJ. Classification and prognosis of invasive breast cancer: from morphology to molecular taxonomy. Modern Pathology: An Official Journal of the United States and Canadian Academy of Pathology, Inc. 2010;23(Suppl 1):S60–4. doi:10.1038/modpathol.2010.33.
  • Al-Ejeh F, Simpson PT, Sanus JM, Klein K, Kalimutho M, Shi W, Miranda M, Kutasovic J, Raghavendra A, Madore J, et al. Meta-analysis of the global gene expression profile of triple-negative breast cancer identifies genes for the prognostication and treatment of aggressive breast cancer. Oncogenesis. 2014;3(4):e100. doi:10.1038/oncsis.2014.14.
  • Howlader N, Altekruse SF, Li CI, Chen VW, Clarke CA, Ries LAG, Cronin KA. US incidence of breast cancer subtypes defined by joint hormone receptor and HER2 status. JNCI: Journal of the National Cancer Institute. 2014;106(5). doi:10.1016/S1476-5586(04)80047-2.
  • Davis M, Tripathi S, Hughley R, He Q, Bae S, Karanam B, Martini R, Newman L, Colomb W, Grizzle W, et al. 2018. AR negative triple negative or “quadruple negative” breast cancers in African American women have an enriched basal and immune signature. PLOS ONE. 13(6):e0196909. DOI:10.1371/journal.pone.0196909
  • Rubin I, Yarden Y. The basic biology of HER2. Annals of Oncology: Official Journal of the European Society for Medical Oncology/ESMO. 2001;12(Suppl 1):S3–8. doi:10.1093/annonc/12.suppl_1.S3.
  • Schiff R, Massarweh SA, Shou J, Bharwani L, Arpino G, Rimawi M, Osborne CK. Advanced concepts in estrogen receptor biology and breast cancer endocrine resistance: implicated role of growth factor signaling and estrogen receptor coregulators. Cancer Chemother Pharmacol. 2005;56(Suppl S1):10–20. doi:10.1007/s00280-005-0108-2.
  • Seton-Rogers S. Breast cancer: untangling the role of progesterone receptors. Nature reviews. Cancer. 2015;15(8):456.
  • Moasser MM. The oncogene HER2: its signaling and transforming functions and its role in human cancer pathogenesis. Oncogene. 2007;26(45):6469–6487. doi:10.1038/sj.onc.1210477.
  • Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology. 2009;27(8):1160–1167. doi:10.1200/JCO.2008.18.1370.
  • Sørlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, et al.Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003 accessed 2020 Apr 6;100(14):8418–8423. doi:10.1073/pnas.0932692100.
  • Dai X, Li T, Bai Z, Yang Y, Liu X, Zhan J, Shi B. Breast cancer intrinsic subtype classification, clinical use and future trends. Am J Cancer Res. 2015;5(10):2929–2943.
  • Dent R, Trudeau M, Pritchard KI, Hanna WM, Kahn HK, Sawka CA, Lickley LA, Rawlinson E, Sun P, Narod SA. Triple-negative breast cancer: clinical features and patterns of recurrence. Clinical Cancer Research: An Official Journal of the American Association for Cancer Research. 2007;13(15):4429–4434. doi:10.1158/1078-0432.CCR-06-3045.
  • Lehmann BD, Bauer JA, Chen X, Sanders ME, Chakravarthy AB, Shyr Y, Pietenpol JA. Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. J Clin Invest. 2011;121(7):2750–2767. doi:10.1172/JCI45014.
  • Siddharth S, Sharma D. 2018. Racial disparity and triple-negative breast cancer in African-American Women: a multifaceted affair between obesity, biology, and socioeconomic determinants. Cancers. 10(12):514. 10.3390/cancers10120514.
  • Amos KD, Adamo B, Anders CK. Triple-negative breast cancer: an update on neoadjuvant clinical trials. Int J Breast Cancer. 2012;2012:385978. doi:10.1155/2012/385978.
  • Minckwitz G, von, von Minckwitz G, Untch M, Blohmer J-U, Costa SD, Eidtmann H, Fasching PA, Gerber B, Eiermann W, Hilfrich J, et al. 2012. Definition and Impact of pathologic complete response on prognosis after neoadjuvant chemotherapy in various intrinsic breast cancer subtypes. Journal of Clinical Oncology. 30(15):1796–1804. DOI:10.1200/jco.2011.38.8595
  • Liedtke C, Mazouni C, Hess KR, André F, Tordai A, Mejia JA, Fraser Symmans W, Gonzalez-Angulo AM, Hennessy B, Green M, et al. 2008. Response to neoadjuvant therapy and long-term survival in patients with triple-negative breast cancer. Journal of Clinical Oncology. 26(8):1275–1281. DOI:10.1200/jco.2007.14.4147
  • Carey LA, Dees EC, Sawyer L, Gatti L, Moore DT, Collichio F, Ollila DW, Sartor CI, Graham ML, Perou CM. The triple negative paradox: primary tumor chemosensitivity of breast cancer subtypes. Clinical Cancer Research: An Official Journal of the American Association for Cancer Research. 2007;13(8):2329–2334. doi:10.1158/1078-0432.CCR-06-1109.
  • Lehmann BD, Pietenpol JA. Identification and use of biomarkers in treatment strategies for triple-negative breast cancer subtypes. J Pathol. 2014;232(2):142–150. doi:10.1002/path.4280.
  • Penault-Llorca F, Viale G. Pathological and molecular diagnosis of triple-negative breast cancer: a clinical perspective. Annals of Oncology: Official Journal of the European Society for Medical Oncology/ESMO. 2012;23(Suppl 6):vi19–22. doi:10.1093/annonc/mds190.
  • Henderson IC, Patek AJ. The relationship between prognostic and predictive factors in the management of breast cancer. Breast Cancer Res Treat. 1998;52(1–3):261–288.
  • Haque R, Ahmed SA, Inzhakova G, Shi J, Avila C, Polikoff J, Bernstein L, Enger SM, Press MF. Impact of breast cancer subtypes and treatment on survival: an analysis spanning two decades. Cancer Epidemiology, Biomarkers & Prevention: A Publication of the American Association for Cancer Research, Cosponsored by the American Society of Preventive Oncology. 2012;21(10):1848–1855. doi:10.1158/1055-9965.EPI-12-0474.
  • Morris GJ, Naidu S, Topham AK, Guiles F, Xu Y, McCue P, Schwartz GF, Park PK, Rosenberg AL, Brill K, et al. Differences in breast carcinoma characteristics in newly diagnosed African--American and caucasian patients: a single-institution compilation compared with the national cancer institute’s surveillance, epidemiology, and end results database. Cancer: Interdisciplinary International Journal of the American Cancer Society. 2007;110(4):876–884. doi:10.1002/cncr.22836.
  • DeSantis C, Siegel R, Jemal A. Breast cancer facts & figures 2015-2016. Am Cancer Soc. 2015; 44.
  • Meng L, Maskarinec G, Wilkens L. Ethnic differences and factors related to breast cancer survival in Hawaii. Int J Epidemiol. 1997;26(6):1151–1158. doi:10.1093/ije/26.6.1151.
  • Meng L, Maskarinec G, Lee J. Ethnicity and conditional breast cancer survival in Hawaii. J Clin Epidemiol. 1997;50(11):1289–1296. doi:10.1016/S0895-4356(97)00183-2.
  • Loi S, Piccart M, Sotiriou C. The use of gene-expression profiling to better understand the clinical heterogeneity of estrogen receptor positive breast cancers and tamoxifen response. Crit Rev Oncol Hematol. 2007;61(3):187–194. doi:10.1016/j.critrevonc.2006.09.005.
  • Parada H, Sun X, Fleming JM, Williams-devane CR, Kirk EL, Olsson LT, Perou CM, Olshan AF, Troester MA. Race-associated biological differences among luminal A and basal-like breast cancers in the carolina breast cancer study. Breast Cancer Research. 2017;19(1). doi:10.1186/s13058-017-0914-6.
  • Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013;41(D1):D991–5. doi:10.1093/nar/gks1193.
  • Athar A, Füllgrabe A, George N, Iqbal H, Huerta L, Ali A, Snow C, Fonseca NA, Petryszak R, Papatheodorou I, et al. ArrayExpress update – from bulk to single-cell expression data. Nucleic Acids Res. 2019;47(D1):D711–D715. doi:10.1093/nar/gky964.
  • Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM. ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia. 2004;6(1):1–6. doi:10.1016/S1476-5586(04)80047-2.
  • Brueffer C, Vallon-Christersson J, Grabau D, Ehinger A, Häkkinen J, Hegardt C, Malina J, Chen Y, Bendahl P-O, Manjer J, et al. Clinical Value of RNA sequencing–based classifiers for prediction of the five conventional breast cancer biomarkers: a report from the population-based multicenter sweden cancerome analysis network—breast initiative. JCO Precision Oncology. 2018;(2):1–18. doi:10.1200/po.17.00135.
  • Burstein MD, Tsimelzon A, Poage GM, Covington KR, Contreras A, Fuqua SAW, Savage MI, Osborne CK, Hilsenbeck SG, Chang JC, et al. Comprehensive genomic analysis identifies novel subtypes and targets of triple-negative breast cancer. Clinical Cancer Research: An Official Journal of the American Association for Cancer Research. 2015;21(7):1688–1698. doi:10.1158/1078-0432.CCR-14-0432.
  • Den Hollander P, Rawls K, Tsimelzon A, Shepherd J, Mazumdar A, Hill J, Fuqua SAW, Chang JC, Osborne CK, Hilsenbeck SG, et al. Phosphatase PTP4A3 promotes triple-negative breast cancer growth and predicts poor patient survival. Cancer Res. 2016;76(7):1942–1953. doi:10.1158/0008-5472.CAN-14-0673.
  • Chang JC, Wooten EC, Tsimelzon A, Hilsenbeck SG, Gutierrez MC, Tham Y-L, Kalidas M, Elledge R, Mohsin S, Osborne CK, et al. Patterns of resistance and incomplete response to docetaxel by gene expression profiling in breast cancer patients. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology. 2005;23(6):1169–1177. doi:10.1200/JCO.2005.03.156.
  • Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, Kuo W-L, Lapuk A, Neve RM, Qian Z, Ryder T, et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 2006;10(6):529–541. doi:10.1016/j.ccr.2006.10.009.
  • Creighton CJ, Sada YH, Zhang Y, Tsimelzon A, Wong H, Dave B, Landis MD, Bear HD, Rodriguez A, Chang JC. A gene transcription signature of obesity in breast cancer. Breast Cancer Res Treat. 2012;132(3):993–1000. doi:10.1007/s10549-011-1595-y.
  • Curtis C, Shah SP, Chin S-F, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346–352. doi:10.1038/nature10983.
  • Huang -C-C, Tu S-H, Lien -H-H, Jeng J-Y, Huang C-S, Huang C-J, Lai L-C, Chuang EY. Concurrent gene signatures for han chinese breast cancers. PloS One. 2013;8(10):e76421. doi:10.1371/journal.pone.0076421.
  • Julka PK, Chacko RT, Nag S, Parshad R, Nair A, Oh DS, Hu Z, Koppiker CB, Nair S, Dawar R, et al. A phase II study of sequential neoadjuvant gemcitabine plus doxorubicin followed by gemcitabine plus cisplatin in patients with operable breast cancer: prediction of response using molecular profiling. Br J Cancer. 2008;98(8):1327–1335. doi:10.1038/sj.bjc.6604322.
  • Lin Y, Lin S, Watson M, Trinkaus KM, Kuo S, Naughton MJ, Weilbaecher K, Fleming TP, Aft RL. A gene expression signature that predicts the therapeutic response of the basal-like breast cancer to neoadjuvant chemotherapy. Breast Cancer Res Treat. 2010;123(3):691–699. doi:10.1007/s10549-009-0664-y.
  • Lindner R, Sullivan C, Offor O, Lezon-Geyda K, Halligan K, Fischbach N, Shah M, Bossuyt V, Schulz V, Tuck DP, et al. 2013. Molecular phenotypes in triple negative breast cancer from African American patients suggest targets for therapy. PloS One. 8(11):e71915. DOI:10.1371/journal.pone.0071915
  • Loi S, Sotiriou C, Haibe-Kains B, Lallemand F, Conus NM, Piccart MJ, Speed TP, McArthur GA. Gene expression profiling identifies activated growth factor signaling in poor prognosis (Luminal-B) estrogen receptor positive breast cancer. BMC Med Genomics. 2009;2(1):37. doi:10.1186/1755-8794-2-37.
  • Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis P, Harris A, Bergh J, Foekens JA, et al. Definition of clinically distinct molecular subtypes in estrogen receptor–positive breast carcinomas through genomic grade. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology. 2007;25(10):1239–1246. doi:10.1200/JCO.2006.07.1522.
  • Loi S, Haibe-Kains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, Gillet C, Ellis P, Ryder K, Reid JF, et al. Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC Genomics. 2008;9(1):239. doi:10.1186/1471-2164-9-239.
  • Loi S, Haibe-Kains B, Majjaj S, Lallemand F, Durbecq V, Larsimont D, Gonzalez-Angulo AM, Pusztai L, Fraser Symmans W, Bardelli A, et al.PIK3CA mutations associated with gene signature of low mTORC1 signaling and better outcomes in estrogen receptor-positive breast cancer. Proc Natl Acad Sci U S A. 2010 accessed 2020 Mar 6;107(22):10208–10213. doi:10.1073/pnas.0907011107.
  • Lu X, Lu X, Wang ZC, Iglehart JD, Zhang X, Richardson AL. Predicting features of breast cancer with gene expression patterns. Breast Cancer Res Treat. 2008;108(2):191–201. doi:10.1007/s10549-007-9596-6.
  • Miyake T, Nakayama T, Naoi Y, Yamamoto N, Otani Y, Kim SJ, Shimazu K, Shimomura A, Maruyama N, Tamaki Y, et al. 2012. GSTP1 expression predicts poor pathological complete response to neoadjuvant chemotherapy in ER-negative breast cancer. Cancer Sci. 103(5):913–920. DOI:10.1111/j.1349-7006.2012.02231.x
  • Pawitan Y, Bjöhle J, Amler L, Borg A-L, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, et al. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Research: BCR. 2005;7(6):R953–64. doi:10.1186/bcr1325.
  • Hall P, Ploner A, Bjöhle J, Huang F, Lin C-Y, Liu ET, Miller LD, Nordgren H, Pawitan Y, Shaw P, et al. Hormone-replacement therapy influences gene expression profiles and is associated with breast-cancer prognosis: a cohort study. BMC Med. 2006;4(1):16. doi:10.1186/1741-7015-4-16.
  • Prat A, Bianchini G, Thomas M, Belousov A, Cheang MCU, Koehler A, Gomez P, Semiglazov V, Eiermann W, Tjulandin S, et al. 2014. Research-Based PAM50 subtype predictor identifies higher responses and improved survival outcomes in HER2-positive breast cancer in the NOAH study. Clinical Cancer Research. 20(2):511–521. DOI:10.1158/1078-0432.ccr-13-0239
  • Rahman M, Jackson LK, Johnson WE, Li DY, Bild AH, Piccolo SR. Alternative preprocessing of RNA-Sequencing data in the cancer genome atlas leads to improved analysis results. Bioinformatics. 2015;31(22):3666–3672. doi:10.1093/bioinformatics/btv377.
  • Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61–70. doi:10.1038/nature11412.
  • Popovici V, Chen W, Gallas BG, Hatzis C, Shi W, Samuelson FW, Nikolsky Y, Tsyganova M, Ishkin A, Nikolskaya T, et al. Effect of training-sample size and classification difficulty on the accuracy of genomic predictors. Breast Cancer Research: BCR. 2010;12(1):R5. doi:10.1186/bcr2468.
  • Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, Su Z, Chu T-M, Goodsaid FM, Pusztai L, et al. 2010. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 28(8):827–838.
  • Tabchy A, Valero V, Vidaurre T, Lluch A, Gomez H, Martin M, Qi Y, Barajas-Figueroa LJ, Souchon E, Coutant C, et al. 2010. Evaluation of a 30-Gene Paclitaxel, Fluorouracil, Doxorubicin, and Cyclophosphamide Chemotherapy response predictor in a multicenter randomized trial in breast cancer. Clinical Cancer Research. 16(21):5351–5361. DOI:10.1158/1078-0432.ccr-10-1265
  • Ulirsch J, Fan C, Knafl G, Wu MJ, Coleman B, Perou CM, Swift-Scanlan T. Vimentin DNA methylation predicts survival in breast cancer. Breast Cancer Res Treat. 2013;137(2):383–396. doi:10.1007/s10549-012-2353-5.
  • Wang J, Scholtens D, Holko M, Ivancic D, Lee O, Hu H, Chatterton RT Jr, Sullivan ME, Hansen N, Bethke K, et al. Lipid metabolism genes in contralateral unaffected breast and estrogen receptor status of breast cancer. Cancer Prevention Research. 2013;6(4):321–330. doi:10.1158/1940-6207.CAPR-12-0304.
  • Guarneri V, Broglio K, Kau S-W, Cristofanilli M, Buzdar AU, Valero V, Buchholz T, Meric F, Middleton L, Hortobagyi GN, et al. Prognostic value of pathologic complete response after primary chemotherapy in relation to hormone receptor status and other factors. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology. 2006;24(7):1037–1044. doi:10.1200/JCO.2005.02.6914.
  • Ali S, Mondal N, Choudhry H, Rasool M, Pushparaj PN, Khan MA, Mahfooz M, Sami GA, Jarullah J, Ali A, et al. Current management strategies in breast cancer by targeting key altered molecular players. Front Oncol. 2016;6:45. doi:10.3389/fonc.2016.00045.
  • Jacquet E, Lardy-Cléaud A, Pistilli B, Franck S, Cottu P, Delaloge S, Debled M, Vanlemmens L, Leheurteur M, Guizard AV, et al. Endocrine therapy or chemotherapy as first-line therapy in hormone receptor–positive HER2-negative metastatic breast cancer patients. Eur J Cancer. 2018;95:93–101. doi:10.1016/j.ejca.2018.03.013.
  • Prat A, Perou CM. Deconstructing the molecular portraits of breast cancer. Mol Oncol. 2011;5(1):5–23. doi:10.1016/j.molonc.2010.11.003.
  • Dunnwald LK, Rossing MA, Li CI. Hormone receptor status, tumor characteristics, and prognosis: a prospective cohort of breast cancer patients. Breast Cancer Research: BCR. 2007;9(1):R6. doi:10.1186/bcr1639.
  • Yersal O, Barutca S. Biological subtypes of breast cancer: prognostic and therapeutic implications. World J Clin Oncol. 2014;5(3):412–424. doi:10.5306/wjco.v5.i3.412.
  • Prat A, Bianchini G, Thomas M, Belousov A, Cheang MCU, Koehler A, Gómez P, Semiglazov V, Eiermann W, Tjulandin S, et al. Research-Based PAM50 subtype predictor identifies higher responses and improved survival outcomes in HER2-positive breast cancer in the NOAH Study. Clinical Cancer Research: An Official Journal of the American Association for Cancer Research. 2014;20(2):511–521. doi:10.1158/1078-0432.CCR-13-0239.
  • Yedjou CG, Sims JN, Miele L, Noubissi F, Lowe L, Fonseca DD, Alo RA, Payton M, Tchounwou PB. Health and racial disparity in breast cancer. Advances in Experimental Medicine and Biology. 2019;1152:31–49.
  • Foy KC, Fisher JL, Lustberg MB, Gray DM, DeGraffinreid CR, Paskett ED, Foy KC, Fisher JL, Lustberg MB, Gray DM, et al. Disparities in breast cancer tumor characteristics, treatment, time to treatment, and survival probability among African American and white women. NPJ Breast Cancer. 2018;4(1):7. doi:10.1038/s41523-018-0059-5.
  • Hill HE, Schiemann WP, Varadan V. Understanding breast cancer disparities—a multi-scale challenge. Annals of Translational Medicine. 2020;8(14):906. doi:10.21037/atm.2020.04.37.
  • DeSantis C, Ma J, Bryan L, Jemal A. 2014. Breast cancer statistics, 2013. CA Cancer J Clin. 64(1):52–62. 10.3322/caac.21203.
  • Power EJ, Chin ML, Haq MM. Breast cancer incidence and risk reduction in the hispanic population. Cureus. 2018;10(2):e2235.
  • Copson E, Maishman T, Gerty S, Eccles B, Stanton L, Cutress RI, Altman DG, Durcan L, Simmonds P, Jones L, et al. Ethnicity and outcome of young breast cancer patients in the United Kingdom: the POSH study. Br J Cancer. 2014;110(1):230–241. doi:10.1038/bjc.2013.650.
  • Howlader N, Noone AM, Krapcho M, Miller D, Bishop K, Kosary CL, Yu M, Ruhl J, Tatalovich Z, Mariotto A. et al., SEER cancer statistics review, 1975-2014. Bethesda (MD): National Cancer Institute. 2017. 2018.
  • Brinton LA, Figueroa JD, Awuah B, Yarney J, Wiafe S, Wood SN, Ansong D, Nyarko K, Wiafe-Addai B, Clegg-Lamptey JN. Breast cancer in Sub-Saharan Africa: opportunities for prevention. Breast Cancer Res Treat. 2014;144(3):467–478. doi:10.1007/s10549-014-2868-z.
  • Smith-Bindman R, Miglioretti DL, Lurie N, Abraham L, Barbash RB, Strzelczyk J, Dignan M, Barlow WE, Beasley CM, Kerlikowske K. Does utilization of screening mammography explain racial and ethnic differences in breast cancer? Ann Intern Med. 2006;144(8):541–553. doi:10.7326/0003-4819-144-8-200604180-00004.
  • Gomez SL, Clarke CA, Shema SJ, Chang ET, Keegan THM, Glaser SL. Disparities in breast cancer survival among Asian women by ethnicity and immigrant status: a population-based study. Am J Public Health. 2010;100(5):861–869. doi:10.2105/AJPH.2009.176651.
  • Dietze EC, Sistrunk C, Miranda-Carboni G, O’Regan R, Seewaldt VL. Triple-negative breast cancer in African-American women: disparities versus biology. Nature reviews. Cancer. 2015;15(4):248–254.
  • Rhodes A, Jasani B, Barnes DM, Bobrow LG, Miller KD. Reliability of immunohistochemical demonstration of oestrogen receptors in routine practice: interlaboratory variance in the sensitivity of detection and evaluation of scoring systems. J Clin Pathol. 2000;53(2):125–130. doi:10.1136/jcp.53.2.125.
  • Chebil G, Bendahl P-O, Ferno M. Estrogen and progesterone receptor assay in paraffin-embedded breast cancer. Acta oncologica. 2003;42(1):43–47. doi:10.1080/02841860300672.
  • Sturtz LA, Melley J, Mamula K, Shriver CD, Ellsworth RE. Outcome disparities in African American women with triple negative breast cancer: a comparison of epidemiological and molecular factors between African American and Caucasian women with triple negative breast cancer. BMC Cancer. 2014;14(1). doi:10.1186/1471-2407-14-62.
  • Huo D, Hu H, Rhie SK, Gamazon ER, Cherniack AD, Liu J, Yoshimatsu TF, Pitt JJ, Hoadley KA, Troester M, et al. 2017. Comparison of breast cancer molecular features and survival by African and European Ancestry in the cancer genome atlas. JAMA Oncology. 3(12):1654. DOI:10.1001/jamaoncol.2017.0595
  • Chavez-MacGregor M, Liu S, De Melo-Gagliato D, Chen H, Do KA, Pusztai L, Symmans WF, Nair L, Hortobagyi GN, Mills GB, et al. Differences in gene and protein expression and the effects of race/ethnicity on breast cancer subtypes. Cancer Epidemiology, Biomarkers & Prevention: A Publication of the American Association for Cancer Research, Cosponsored by the American Society of Preventive Oncology. 2014;23(2):316–323. doi:10.1158/1055-9965.EPI-13-0929.
  • Shou J, Massarweh S, Osborne CK, Wakeling AE, Ali S, Weiss H, Schiff R. Mechanisms of Tamoxifen Resistance: increased Estrogen Receptor-HER2/neu Cross-Talk in ER/HER2-positive breast cancer. JNCI Journal of the National Cancer Institute. 2004 accessed 2020 Jun 19;96(12):926–935. doi:10.1093/jnci/djh166.
  • Ching T, Huang S, Garmire LX. Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014;20(11):1684–1696. doi:10.1261/rna.046011.114.
  • Toro-Domínguez D, Villatoro-García JA, Martorell-Marugán J, Román-Montoya Y, Alarcón-Riquelme ME, Carmona-Sáez P. A survey of gene expression meta-analysis: methods and applications. Brief Bioinform. 2020 Feb 25. 10.1093/bib/bbaa019.
  • Dowsett M, Smith IE, Ebbs SR, Dixon JM, Skene A, A’Hern R, Salter J, Detre S, Hills M, Walsh G, et al. Prognostic value of Ki67 expression after short-term presurgical endocrine therapy for primary breast cancer. JNCI: Journal of the National Cancer Institute. 2007;99(2):167–170. doi:10.1093/jnci/djk020.
  • Goldhirsch A, Winer EP, Coates AS, Gelber RD, Piccart-Gebhart M, Thürlimann B, Senn H-J, Albain KS, André F, Bergh J. Panel members. Personalizing the treatment of women with early breast cancer: highlights of the St Gallen International expert consensus on the primary therapy of early breast cancer 2013. Annals of Oncology: Official Journal of the European Society for Medical Oncology/ESMO. 2013;24(9):2206–2223. doi:10.1093/annonc/mdt303.
  • Chang JC, Wooten EC, Tsimelzon A, Hilsenbeck SG, Gutierrez MC, Elledge R, Mohsin S, Osborne CK, Chamness GC, Allred DC, et al. Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. The Lancet. 2003;362(9381):362–369. doi:10.1016/S0140-6736(03)14023-8.
  • Ayers M, Symmans WF, Stec J, Damokosh AI, Clark E, Hess K, Lecocke M, Metivier J, Booser D, Ibrahim N, et al. 2004. Gene expression profiles predict complete pathologic response to Neoadjuvant Paclitaxel and Fluorouracil, Doxorubicin, and Cyclophosphamide chemotherapy in breast cancer. Journal of Clinical Oncology. 22(12):2284–2293. DOI:10.1200/jco.2004.05.166
  • Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi:10.1186/s13059-014-0550-8.
  • Liao Y, Smyth GK, Shi W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 2019;47(8):e47. doi:10.1093/nar/gkz114.
  • Wesolowski R, Ramaswamy B. Gene expression profiling: changing face of breast cancer classification and management. Gene Expr. 2011;15(3):105–115. doi:10.3727/105221611X13176664479241.
  • Yang YH, Xiao Y, Segal MR. Identifying differentially expressed genes from microarray experiments via statistic synthesis. Bioinformatics. 2005;21(7):1084–1093. doi:10.1093/bioinformatics/bti108.
  • Gaujoux R, Seoighe C. A flexible R package for nonnegative matrix factorization. BMC Bioinform. 2010;11(1):367. doi:10.1186/1471-2105-11-367.
  • Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, et al. 2001. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences. 98(19):10869–10874. DOI:10.1073/pnas.191367098
  • Sotiriou C, Neo S-Y, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, Fox SB, Harris AL, Liu ET. Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc Natl Acad Sci U S A. 2003;100(18):10393–10398. doi:10.1073/pnas.1732912100.
  • Perou CM, Sørlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, et al. Molecular portraits of human breast tumours. Nature. 2000;406(6797):747–752. doi:10.1038/35021093.
  • Prat A, Pineda E, Adamo B, Galván P, Fernández A, Gaba L, Díez M, Viladot M, Arance A, Muñoz M. Clinical implications of the intrinsic molecular subtypes of breast cancer. Breast. 2015;24(Suppl 2):S26–35. doi:10.1016/j.breast.2015.07.008.
  • Bertucci F, Finetti P, Cervera N, Esterni B. How basal are triple negative breast cancers? J Cancer. 2008. https://onlinelibrary.wiley.com/doi/abs/10.1002/ijc.23518
  • Sørlie T, Borgan E, Myhre S, Vollan HK, Russnes H, Zhao X, Nilsen G, Lingjaerde OC, Børresen-Dale A-L, Rødland E. The importance of gene-centring microarray data. Lancet Oncol. 2010;11(8):719–720. doi:10.1016/S1470-2045(10)70174-1. author reply 720–1.
  • Raj-Kumar P-K, Liu J, Hooke JA, Kovatich AJ, Kvecher L, Shriver CD, Hu H. PCA-PAM50 improves consistency between breast cancer intrinsic and clinical subtyping reclassifying a subset of luminal A tumors as luminal B. Sci Rep. 2019;9(1):7956. doi:10.1038/s41598-019-44339-4.
  • Roelands J, Decock J, Boughorbel S, Rinchai D, Maccalli C, Ceccarelli M, Black M, Print C, Chou J, Presnell S, et al. 2017. A collection of annotated and harmonized human breast cancer transcriptome datasets, including immunologic classification. F1000Research. 6:296. 10.12688/f1000research.10960.1.
  • Karn T, Pusztai L, Holtrich U, Iwamoto T, Shiang CY, Schmidt M, Müller V, Solbach C, Gaetje R, Hanker L, et al. Homogeneous datasets of triple negative breast cancers enable the identification of novel prognostic and predictive signatures. PloS One. 2011;6(12):e28403. doi:10.1371/journal.pone.0028403.
  • Gong M, Ye S, Lv W, He K, Li W. Comprehensive integrated analysis of gene expression datasets identifies key anti cancer targets in different stages of breast cancer. Exp Ther Med. 2018. accessed 2020 Apr 8. 16(2):802–810.
  • Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics. 2010;11(10):733–739. doi:10.1038/nrg2825.
  • Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28(6):882–883. doi:10.1093/bioinformatics/bts034.
  • Waldron L, Riester M. Meta-analysis in gene expression studies. Methods in Molecular Biology. 2016;161–176. doi:10.1007/978-1-4939-3578-9_8.
  • Patil P, Parmigiani G. Training replicable predictors in multiple studies. Proc Natl Acad Sci U S A. 2018;115(11):2578–2583. doi:10.1073/pnas.1708283115.
  • Gonçalves RS, Musen MA. The variable quality of metadata about biological samples used in biomedical experiments. Scientific Data. 2019;6(1):190021. doi:10.1038/sdata.2019.21.
  • Gendoo DMA, Zon M, Sandhu V, Manem VSK, Ratanasirigulchai N, Chen GM, Waldron L, Haibe-Kains B. MetaGxData: clinically annotated breast, ovarian and pancreatic cancer datasets and their use in generating a multi-cancer gene signature. Sci Rep. 2019;9(1):8770. doi:10.1038/s41598-019-45165-4.
  • Gibson G. The environmental contribution to gene expression profiles. Nature Reviews Genetics. 2008;9(8):575–581. doi:10.1038/nrg2383.
  • Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, et al. The FAIR guiding principles for scientific data management and stewardship. Scientific Data. 2016;3(1):160018. doi:10.1038/sdata.2016.18.
  • Hippen AA, Greene CS. Expanding and remixing the metadata landscape. Trends Cancer Res. 2021;7(4):276–278. doi:10.1016/j.trecan.2020.10.011.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.