1,216
Views
0
CrossRef citations to date
0
Altmetric
Research Articles

Genetic differentiation between and within Northern Native American language groups: an argument for the expansion of the Native American CODIS database

, , , , , & ORCID Icon show all
Pages 662-672 | Received 12 Dec 2020, Accepted 11 Jun 2021, Published online: 20 Sep 2021

Abstract

The National Research Council recommends that genetic differentiation among subgroups of ethnic samples be lower than 3% of the total genetic differentiation within the ethnic sample to be used for estimating reliable random match probabilities for forensic use. Native American samples in the United States’ Combined DNA Index System (CODIS) database represent four language families: Algonquian, Na-Dene, Eskimo-Aleut, and Salishan. However, a minimum of 27 Native American language families exists in the US, not including language isolates. Our goal was to ascertain whether genetic differences are correlated with language groupings and, if so, whether additional language families would provide a more accurate representation of current genetic diversity among tribal populations. The 21 short tandem repeat (STR) loci included in the Globalfiler® PCR Amplification Kit were used to characterize six indigenous language families, including three of the four represented in the CODIS database (i.e. Algonquian, Na-Dene, and Eskimo-Aleut), and two language isolates (Miwok and Seri) using major population genetic diversity metrics such as F statistics and Bayesian clustering analysis of genotype frequencies. Most of the genetic variation (97%) was found to be within language families instead of among them (3%). In contrast, when only the three of the four language families represented in both the CODIS database and the present study were considered, 4% of the genetic variation occurred among the language groups. Bayesian clustering resulted in a maximum posterior probability indicating three genetically distinct groups among the eight language families and isolates: (1) Eskimo, (2) Seri, and (3) all other language groups and isolates, thus confirming genetic subdivision among subgroups of the CODIS Native American database. This genetic structure indicates the need for an increased number of Native American populations based on language affiliation in the CODIS database as well as more robust sample sets for those language families.

Supplemental data for this article is available online at https://doi.org/10.1080/20961790.2021.1963088 .

Introduction

The Combined DNA Index System (CODIS) refers to the Federal Bureau of Investigation’s (FBI) software associated with forensic DNA databases maintained at the local, state, and federal levels. CODIS has three primary databases: the offender/arrestee, missing persons, and forensic casework databases. The offender reference database consists of profiles belonging to convicted individuals and arrestees depending on local and federal laws. When dealing with an offender profile, there are two primary ways in which identification is made: (1) the profile of an unknown offender is searched against the CODIS offender/arrestee database to either identify a suspect from known offenders or make connections between crimes and (2) generating a random match probability, the probability that a random individual from a given population exhibits the particular combination of alleles exhibited by a known, suspected individual. Because some alleles and allele combinations have a higher frequency in some populations than in others, separate databases for the different ethnic populations found in the US are required to account for population-specific allele frequencies when estimating random match probabilities [Citation1]. Obtaining the random match probability for a suspected individual profile against each of the different ethnic databases provides a more complete and accurate result than that for the combined database. The National Research Council recommends that genetic differentiation among subgroups (i.e. genetic subdivision) of such databases be lower than 3% of the total genetic variation within that database to provide reliable random match probabilities.

Language affiliation has often been used to define membership in genetically distinct populations. While as many as 7 097 distinct languages are currently spoken worldwide, only 23 are spoken by more than half the world’s populations (www.ethnologue.com, accessed 13 May 2018). As a language changes over time, its divergence can be characteri­zed much the same way as gene divergence is illustrated using a phylogenetic tree [Citation2–4]. While genes and languages may appear to exhibit parallel histories if they have dispersed together as populations expressing them diversify, their histories may lack congruence due to differences in genetic and linguistic histories or uncertainties in the language classification [Citation5]. Geographic isolation and drift could drive local genetic and language differentiation; however, while geographic isolation of a popu­lation typically decreases its genetic diversity, evidence-based phonemic data suggest that such isolation may not necessarily lead to the erosion of language diversity [Citation6]. Parental or ancestral languages produce daughter languages over time through cultural and language evolution. Language evolution can occur in several different ways, most notably through geographic separation and isolation, population growth and expansion, as well as language shift or the usurpation of one language by another more prestigious one as different populations interact [Citation4]. Language families constitute languages that originate from the same parental or ancestral language [Citation7]. For this study, a language family refers to the parental language from which several daughter languages and/or dialects derive. Additionally, a language isolate has two accepted definitions: (1) a language with no identifiable relationship with any other known language, appearing as an anomalous outlier, and (2) a language belonging to a family of which no other member languages survive [Citation8].

While a minimum of 56 language families and isolates are represented among Native Americans in the continental US, only four are currently represented in the Native American CODIS database, i.e. the Algonquian, Eskimo-Aleut, Na-Dene, and Salishan language families. In the present study, a minimum of 25 samples representing each of six language families (including three of the four, albeit not necessarily the same member languages, represented in the CODIS database) and two language isolates were analyzed using the 21 autosomal loci included in the Globalfiler• PCR Amplification kit to characterize their genetic structure, composition and differentiation. That level of differentiation, represented by FST, was compared with values of 0.03 that are recommended by the National Research Council for generating reliable random match probabilities for Native American tribal samples [Citation1].

Mitochondrial DNA (mtDNA) and Y chromosome-linked short tandem repeat (Y-STR) evidence reveals a greater genetic diversification among (albeit less diversity within) Native American populations speaking different languages than among populations speaking different languages in other parts of the world [Citation9–12]. There are currently 29 language families and 27 language isolates widely recognized in North America, of which six language families (Algonquian, Na-Dene, Eskimo-Aleut, Iroquoian, Uto-Aztecan, and Yuman) and two language isolates (Miwok and Seri) were included in this study. The “farming language dispersal hypo­thesis” postulates that languages of the Uto-Aztecan language family differentiated via the spread of farming practices, but not necessarily genes, from Mexico into the American Southwest [Citation13]. This is supported by a significant correlation between language families and genetic dispersal in which 90.7% of mtDNA haplotypes within a language family are population-specific while fewer than 10% are shared between or among populations [Citation14]. However, notable mtDNA differences among some geographically separated daughter languages of the Uto-Aztecan language family suggest the possibility of significant levels of genetic variation within language families possibly due to adoption of the more prestigious Uto-Aztecan language by genetically different popu­lations that also adopted the new farming practices [Citation15].

Native American populations have become further subdivided linguistically due to physical isolation [Citation5] as they dispersed into the American continent at different times via the Beringian Strait beginning approximately 20–25 kya [Citation16]. The isolation from Asia created a severe genetic bottleneck exacerbated by the Beringian Standstill (also known as the Beringian Incubation), lasting several thousands of years, and subsequent dispersals throughout the American continent [Citation16] resulted in only five extant mtDNA haplogroups [Citation12] (). Language and other cultural boundaries have created highly effective barriers to gene flow, even among populations sharing relatively small geographic areas of residence [Citation17]. This isolation is reflected by geographically restricted alleles such as the albumin variants Naskapi and Mexico and the B2a sub-haplogroup defined by the mtDNA mutation 16483A [Citation18].

Table 1. Tribal and sample distributions among the eight language families/isolates and the primary mtDNA haplogroups associated with each tribe represented in the sample set (N=502).

While numerous studies have been conducted on the correlation between language and mitochondrial genetic differentiation among Native American popu­lations [Citation3, Citation16, Citation19–21], mtDNA, like Y-STRs, provides a limited perspective on genetic differentiation within and among populations due to its exclusively uni-parental (i.e. maternal) inheritance and the consequent absence of genetic recombination [Citation12, Citation15, Citation22]. Conversely, nuclear genetic markers are inherited from both parents and undergo recombination producing a higher effective population size than mtDNA, thereby reducing the effects of genetic drift [Citation23]. Furthermore, the use of several different nuclear loci allows the comparison of gene flow estimates that are not confounded by the inherent differences between maternally and biparentally inherited genes. As such, nuclear markers provide a more comprehensive understanding of genetic structure and composition within and among Native American language families and isolates.

Several studies focused on nuclear genetic differentiation among Native American populations have reported a nuclear genetic bottleneck congruent with mitochondrial evidence [Citation5, Citation24–26]. Additionally, nuclear data have detected a greater inter-population genetic differentiation among Native American populations than that observed among other North American populations, including Caucasians, African-Americans, Asian-Americans, and Hispanics [Citation1]. More recent studies reflect the presence of even greater levels of genetic variability among tribal populations than those based on CODIS-STRs [Citation1], likely due to greater tribal representation and a more comprehensive geographic sampling strategy than in the former studies [Citation27, Citation28].

The present study aims to determine the extent to which language differentiation among six Native American language families and two language isolates in North America is correlated with their nuclear genetic differentiation and whether or not the genetic composition of the current Native American CODIS database is truly representative of that of Native Americans in the US. This study used 21 autosomal STR loci in the Globalfiler• PCR Amplification Kit (Thermo Fisher Scientific, Waltham, MA, USA) to evaluate nuclear genetic diversity, differentiation, and structure within and among samples of six Native American language families and two language isolates that represent a wide geographic range covering the Arctic, the Midwestern, Southwestern, and Southeastern US, California/Great Basin and Mexico, including Baja California (), as previously done [Citation27, Citation28]. While other studies have used many more STR markers [Citation5, Citation29], the Globalfiler• panel consists of the most utilized human STR markers in the world for individual differentiation and identification; therefore, results based on these markers not only have significant value in forensics and human gene­tics research but also provide a better comparative basis for evaluating the genetic structure and composition of human populations around the world.

Materials and methods

Five hundred and two samples representing members of six language families and two language isolates (n ≥ 25 samples per language family or isolate; ; ) met the minimum optimal concentration for amplification (0.1 ng/µL) and were subsequently genotyped for this study. These samples were selected post-quantification from approximately 1 935 DNA samples that were collected and extracted from whole blood using conventional techniques for various genetic studies [Citation12, Citation14, Citation15, Citation18, Citation21, Citation24, Citation30, Citation31] and an additional 152 samples extracted from blood, buffy coat, serum, and plasma using the QIAamp DNA Blood Mini Kit (QIAGEN, Redwood City, CA, USA) following manufacturer protocols as described in previous studies [Citation27, Citation28]. DNA quantities from the combined 2 087 samples were determined in duplicate on an Applied Biosystems 7500 Fast Real-Time PCR System (Applied Biosystems, Foster City, CA, USA) using the Quantifiler• Duo DNA Quantification Kit (Thermo Fisher Scientific) following manufacturer protocols. The low proportion of samples quantified at above 0.1 ng/µL DNA is likely due to their age and variable storage conditions.

Figure 1. Map of tribes represented in Combined DNA Index System (CODIS) (solid blue squares) and the present study (solid circles). Note that all language groups included in the CODIS database, with the single exception of Salishan, are also represented in the present study. The legend identifies the language families of tribes included in this study.

Figure 1. Map of tribes represented in Combined DNA Index System (CODIS) (solid blue squares) and the present study (solid circles). Note that all language groups included in the CODIS database, with the single exception of Salishan, are also represented in the present study. The legend identifies the language families of tribes included in this study.

The concentrations of the 502 extracted DNA samples, as well as the Standard Reference Material (SRM) 2391c reference DNA (National Institute of Standards and Technology, Gaithersburg, MD, USA), were normalized to approximately 1.0 ng/µL and amplified using the Globalfiler• PCR Amplification Kit following manufacturer protocols. The Globalfiler• Kit consists of the following autosomal STR loci that were selected to maximize discrimination potential for human identification and kinship analyses [Citation32]: CSF1PO, FGA, TH01, TPOX, vWA, D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51, D21S11 D1S1656, D2S441, D2S1338, D10S1248, D12S391, D19S433, D22S1045, and SE33. The amplified samples were diluted in Hi-Di Formamide (Applied Biosystems) and run on a 3130xl Genetic Analyzer (Applied Biosystems) with POP-4 polymer (Applied Biosystems) according to run conditions and protocols recommended by the manufacturer. Profiles were analyzed using GeneMapper• ID-X v 1.4 (Applied Biosystems) and the Local Southern sizing method. Samples with saturated off-scale alleles were diluted based on the severity of the pull-up and re-injected.

The allele frequency distribution, average number of alleles (NA), observed heterozygosity (HO), and expected heterozygosity (HE) were estimated using ARLEQUIN v3.5.1.2 [Citation33] to characterize language-specific allele distribution patterns and the extent of genetic diversification within and among the language families and isolates. To ensure that differing sample sizes did not bias allelic diversity, estimates of average allelic richness (NR) were calculated across all 21 loci for each language group and isolate after their sample sizes were normalized to the smallest population size using the Allelic Diversity Analyzer (ADZE) software programme [Citation34]. ARLEQUIN was also used to calculate the F-statistics FIS (local) and FIT (total) inbreeding coefficients, and FST, the proportion of genetic variance in a population that is due to differences among subdivisions within that language sample; pairwise FST, the degree of differentiation between pairs of language samples, was also calculated to provide insight into the historical connections among these languages [Citation35]. The statistical significance of the pairwise FST computations was determined with a probability distribution constructed from permutation tests (N = 1 000) with Bonferroni corrections for multiple comparisons. The P-values for deviations from Hardy-Weinberg Equilibrium (HWE) for each locus in each language group were estimated with GENEPOP 4.2 [Citation36] and ARLEQUIN, respectively.

An analysis of molecular variance, AMOVA [Citation33], was performed using ARLEQUIN based on all 21 STR loci to ascertain if the language affiliation of Native American populations contributed to the extant genetic structure. Statistical significance of the AMOVA values was estimated by a permutation test (10 000 permutations) at the 0.05 level of proba­bility. CONVERT v1.31 [37] was used to determine the number and frequency of private alleles present in each population sample. A model-based clustering method featured in the STRUCTURE v2.3.4 software programme [Citation38] was used to infer genetic groups using the 21 STR loci genotype frequencies. All runs were based on the admixture model that assumes each person has ancestry in multiple genetic clusters. The model assumes there are K genetically distinct clusters (where the true value of K is typically unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample were assigned probabilistically to a specific genetic group or jointly to two or more genetic groups if their genotypes indicate they are admixed. Each run used 500 000 estimation iterations for K = 1 to 8 after a 100 000 burn-in length. Each run was carried out five times for each value of K. The log probability of the data (Ln P(D)) and delta K (i.e. the K value with the highest Ln P(D) and lowest standard deviation) were used to determine the true number of genetically distinct populations [Citation39] represented by the data.

Results

NA estimates for the eight language families or isolates ranged from 5.52 (Eskimo) to 9.43 (Uto-Aztecan). No deviation in NA estimates attribu­table to variation in sample sizes of language groups was detected (). For instance, the Yuman (n = 130) and Algonquian (n = 25) language families had the largest and smallest sample sizes, respectively, but the NA estimates from both these groups were comparable, i.e. 7.62 and 7.95, respectively. Moreover, the highest NA estimate represented the Uto-Aztecan language family (9.43), only the second largest language sample in this study (n = 112), and the smallest NA estimate represented the Eskimo-Aleut language family (5.52), whose sample size (n = 44) was almost twice that for the Algonquian language family (n = 25) which represented the smallest sample size. All language groups and isolates included in this study contained NR values between 4.68 (Na-Dene) and 6.61 (Algonquian), reflecting a minimal impact of sample size (). The values of HO and HE in also appeared to be uninfluenced by sample size with HO values ranging from 0.66 (Seri n = 29) to 0.78 (Algonquian n = 25 and Miwok n = 33) and EH values ranging from 0.64 (Seri n = 29) to 0.77 (Algonquian n = 25 and Iroquoian n = 33). All language groups exhibited either no marker or only a single marker violating HWE at the 0.01 level of probability with the exception of the Yuman, which had four loci not in HWE (). None of the language groups were in HWE at the P = 0.01 level of probability (), and when all groups were combined, seven loci violated HWE (Supplementary Table S9).

Table 2. Allele number (NA) and observed (HO) and expected (HE) heterozygosity for each language.

Pairwise and population-specific FST estimates and average FIS are presented in . The pairwise FST values ranged from 0.011 (between the Na-Dene and Yuman speakers) to 0.113 (between the Eskimo-Aleut and Seri speakers), and all were statistically significant at the 0.05 level of probability. Both the Eskimo-Aleut language family and the Seri language isolate exhibited the greatest average genetic divergence from the other populations with mean pairwise FST values of 0.069 and 0.072, respectively, while the Yuman language family had the lowest population-specific FST (0.028). FIS values () were highest for the Uto-Aztecan (0.039), followed by the Iroquoian language family (0.035). Miwok, Seri, and Algonquian language speakers exhibited a lack of inbreeding reflected by their negative FIS values. Overall, FIS, FST, and FIT estimates across all language samples were 0.019, 0.032, and 0.051, respectively.

Table 3. Pairwise (below diagonal) and population-specific FST, and FIS estimates based on the 21 autosomal STR loci analyzed for the eight language families/isolates.

The AMOVA showed that genetic differences among individuals speaking different languages represent 4%–5% of the total genetic variation observed in this study, and these differences were statistically significant at the 0.05 level of probability. Based on the Ln P(D) and delta K results, there are three distinct genetic clusters (K = 3) indicating genetic subdivision among the eight language families or isolates (). Assuming the most probabilistic model of three “true genetic groups” (K = 3), the STRUCTURE analysis reflected a strong correlation between genotypic distributions and membership in each of the three groups (). While many of the individuals display partial membership in more than a single one of the three “true genetic groups”, suggesting high levels of genetic admixture, Eskimo-Aleut and Seri speakers were predominantly assigned to two different of the three clusters, represented by red and blue, respectively, in , with members of all remaining six language families or isolates () being predominantly assigned to the third “true genetic cluster”, represented by green in . provides the average and ranges of assignment probabilities of all individuals included in this study to the predominant “true genetic group” representing their respective language families/isolates based on the STRUCTURE analysis (). Only five Eskimo-Aleut and four Seri individuals exhibited average assignment probabilities lower than 50% (and one individual of each language sample exhibited a probability lower than 10%) to their respective “true genetic group” (). Three additional Eskimo-Aleut samples and 13 Uto-Aztecan samples exhibited assignment probabilities between 60% and 70% to the predominant “true genetic group” representing their respective language groups (). While the samples exhibiting assignment probabilities lower than 10% might represent misassigned ancestry, they were not removed from our analyses because it cannot be known for sure that this is the case. An alternative explanation for their lower average probabilities of assignment to their language group’s predominant “true genetic group” is unreported admixture with individuals belonging to the other unrelated language groups.

Figure 2. Delta K plot illustrating likelihood of all K values ranging from one to eight true genetic groups required to incorporate all language families and isolates in the dataset accurately; K = 3 was found to be the most probabilistic scenario.

Figure 2. Delta K plot illustrating likelihood of all K values ranging from one to eight true genetic groups required to incorporate all language families and isolates in the dataset accurately; K = 3 was found to be the most probabilistic scenario.

Figure 3. STRUCTURE plot comprising individuals from the six language groups and two language isolates, assuming the most probabilistic model of three true genetic groups (K = 3). The scale at the left margin reflects assignment probabilities to each of the three groups. The x-axis represents individuals from populations sorted according to their language groups/isolates. Each individual is represented by a vertical stacked column of colour-coded admixture proportions that reflect genetic contributions from each of the three true genetic groups.

Figure 3. STRUCTURE plot comprising individuals from the six language groups and two language isolates, assuming the most probabilistic model of three true genetic groups (K = 3). The scale at the left margin reflects assignment probabilities to each of the three groups. The x-axis represents individuals from populations sorted according to their language groups/isolates. Each individual is represented by a vertical stacked column of colour-coded admixture proportions that reflect genetic contributions from each of the three true genetic groups.

Figure 4. Language and genetic correlation among study individuals reveals three separate clusters: Clusters 2 and 3 consist exclusively of speakers of Seri and Eskimo-Aleut, respectively, while Cluster 1 comprises speakers of all other languages included in this study.

Figure 4. Language and genetic correlation among study individuals reveals three separate clusters: Clusters 2 and 3 consist exclusively of speakers of Seri and Eskimo-Aleut, respectively, while Cluster 1 comprises speakers of all other languages included in this study.

Table 4. The mean, range, and standard deviation (SD) of average assignment probabilities (averaged over five runs for each language family/isolate) of individuals to the predominant “true genetic group” of their language family or language isolate based on the STRUCTURE analysis illustrated in .

Discussion

This study aimed to ascertain if genetic diversity correlates with language group differentiation in Native American populations genotyped at the CODIS loci using the Globalfiler• PCR kit and, if so, whether samples representing Native Americans in the CODIS database are truly representative of Native Americans in the US. While classifications of Native American languages into language families by comparative linguists often differ, the composition and representativeness of language families employed in this study is recognized by more conservative comparative linguists [Citation40]. The Algonquian, Na-Dene, Eskimo-Aleut and Salishan language families, which alone represent Native Americans in the CODIS database, display several unique alleles not found in most other North American groups, including the mtDNA 16192 variant [Citation31] and the Albumin Naskapi allele AL*Naskapi [Citation18], possibly as the result of admixture among the ancestors of these three northern language families whose dispersal patterns overlapped [Citation41]. Sampling biases resulting from combining individuals from different distinct subpopulations into larger language samples for this study likely resulted in all the language groups not conforming to HWE conditions. FST and FIS values () also point to varying levels of genetic drift and subdivision across language groups. The average of approximately eight alleles per locus observed in this study is consistent with previous descriptions of other population samples [Citation1, Citation42]. Comparable estimates of HO and HE in the Alaskan tribes have been reported [Citation1] with an average value of 0.70, while the Eskimo-Aleut samples in the present study exhibited mean values of 0.68 and 0.69, respectively. Subdividing samples by geography, without regard to language affiliation, has previously produced slightly higher estimates of HO and HE (0.72 and 0.73, respectively) [Citation27].

The AMOVA shows that approximately 95% of genetic variation within the language groups is broadly concordant with other reports of worldwide interpopulation variation [Citation43]. The lack of agreement in HO and HE estimates between language and geographic samples is consistent with the AMOVA results in this study, which failed to support the strict linguistic grouping. This may result from geographic contiguity among most of these groups, possibly favouring gene flow and genetic admixture among their speakers. Conversely, the STRUCTURE analyses revealed a lack of evidence of general clustering among six of the eight language affiliations, broadly attesting to the migration and subsequent assimilation of speakers of North American indigenous languages into earlier arriving immigrants. The existence of stratification based on STRUCTURE is due to the separation of most Eskimo-Aleut and Seri speakers from the other language samples, pointing to the pronounced influence of geographic distance or historical isolation on genetic and language differentiation as reported previously in the Seri [Citation44]. This is in stark contrast to previous assessments [Citation5] that suggested that the distribution of genetic variation among Native Americans reveals closer parallels between genetic and language patterns of differentiation; populations sampled from disparate and unevenly distributed geographic regions that represented more divergent major language reservoirs throughout North America (3 localities), Central America (8 localities), and South America (18 localities) would likely reinforce stronger correlations between language and genetic diversification. Integrating up to 678 STR loci may have provided a finer resolution of clinal variation among language samples, resulting from geographic isolation by distance [Citation5].

The F-statistics obtained in this study across all individuals (FIS=0.019, FST=0.032, FIT=0.051) provide evidence to support that the loss of heterozygosity levels, which contributed to each group’s violation of HWE is caused by non-random mating, genetic subdivision, and genetic drift among the language samples. The use of these estimates to ascertain genetic structure and gene flow overcomes sampling issues because it considers resampling over replicate populations [Citation35, Citation36]. This method of estimating FST performs well in cases with moderate amounts of gene flow and large population sizes as it does not tend to overestimate genetic subdivision [Citation35, Citation45].

Language group-specific FST values indicate varying allele frequencies within language families/isolates and average allele frequencies across all language samples included in this study. The low-to-moderate pairwise FST values suggest that not all language families and isolates examined are differentiated, likely resulting from increased gene flow or recent common ancestry. Pairs of language groups that are geographic neighbors tend to be more genetically similar than groups that are more geographically distant, regardless of whether the languages are closely related [Citation6]. In conjunction with the STRUCTURE analysis, the pairwise FST comparisons show that the Eskimo-Aleut and Seri language samples, which, incidentally, are fixed for mtDNA haplogroups A and C, respectively – comprise the most genetically distinct individuals in the study, likely due to their extant geographic isolation from the other samples. It has previously been shown that the Seri population experienced a genetic bottleneck due to reduced gene flow via historical isolation and is distinct from populations in the Uto-Aztecan language group [Citation44] that live in close geographic proximity.

Ignoring group structure, an overall FIT of 0.051 points to a general deviation from HWE frequencies due to an excess of homozygotes even when all samples are combined. As FIS estimates indicate any deviation of the average allele frequency from HWE is probably due to consanguineous mating, the positive FIS values for Na-Dene, Eskimo-Aleut, Iroquoian, Uto-Aztecan, and Yuman speaking groups may reflect an increased relatedness among individuals who speak those languages rather than geographic isolation. In contrast, while inbreeding and geographic isolation are related phenomena, the Miwok and Seri language samples exhibit negative FIS values due to the absence of significant inbreeding despite being isolated linguistically or geographically. This finding suggests that these isolated language groups are more susceptible to genetic distinction due to loss of genetic variation from isolation and drift and not due to consanguineous mating.

Interestingly, while the overall level of genetic diversification among the language samples in this study revealed by the FST analyses (0.032) was very similar to earlier values reported (FST=0.03) [Citation1], based on a small collection of tribal samples, it was somewhat lower than the 0.04 value previously reported [Citation27] based on geographical samples and the same value for only those same three language groups (Algonquian, Na-Dene, and Eskimo-Aleut) representing Native Americans in both the CODIS database and our own study. The slightly lower degree (∼1%) of genetic differentiation in our study supports the notion that language groups are less differentiated than geographic groups because indigenous languages in North America spread more readily than the migration of people (i.e. gene flow) [Citation27]. This is also in agreement with earlier comprehensive analyses that used large datasets to find that genetic and language distances between populations are less correlated with geographic distance in North America compared to most regions worldwide [Citation6].

The geographic distribution of Native Americans has changed dramatically over the past five centuries owing to unique historical events such as early Spanish colonization, European immigration, and forced resettlement [Citation46]. Migration events, isolation by distance, and cultural diffusion are among the factors that can contribute to variation in rates of genetic, geographic, and language differentiation. Because languages are not only inherited linearly but can diffuse laterally across geographic space [Citation47], they change at a faster rate than genes resulting in discordances between genetic and language evolution over distances greater than 10 000 km [Citation48]. It has been maintained that a broad association between language and genetic differentiation exists [Citation20], but the lack of connections among language families and the absence of internal structures in them prevent languages and genes from evolving in synchrony [Citation48]. Additionally, recent shared history in human populations can further reduce the correlation between language and genetic diversification [Citation5, Citation6]. A weak association between genetic and language variation is particularly true of Native American populations, which have a relatively short genetic history in the Americas and have been very recently relocated to reservations or small geographic locations inhabited by speakers of multiple language families [Citation49, Citation50].

The Apache and Navajo tribes in the CODIS database are southwestern US speakers of the Athabaskan languages of the Na-Dene language family [Citation51, Citation52] and closely genetically resemble Athabaskans in Alaska and Canada. The Michigan, Minnesota, and northern Ontario tribes in the CODIS database all belong to the Algonquian language family and also closely resemble each other genetically, but only represent a minority of Native Americans in the US. As such, the current CODIS database comprises only members of the Algonquian (four languages), Na-Dene (the Athabascan, Navaho, and Apache languages), Eskimo-Aleut (the Inupiat and Yupic languages), and the Salishan language families while omitting the vast majority of tribes speaking other languages in North America and Mexico. This study’s genetic assessment of Native American language groups makes it evident that geographic barriers among distantly located populations result in both genetic and language isolation. As Hunley and Long [Citation53] observed, there may be pressure to maintain a language at particular geographical locations due to the cultural complex to which it belongs; however, genes may be free to flow in these situations. As such, this study has highlighted the lack of genetic correlation among most language families and isolates.

When the FST value for the three language fami­lies shared between our dataset and the CODIS database (i.e. Algonquian, Na-Dene, and Eskimo-Aleut) are combined, their FST value is 0.04, 33% higher than the 0.03 FST value recommended by the National Research Council for random match probability generation [Citation1]. When the five additional language families or isolates are combined with these three language families present in the CODIS database, FST=0.032, a value much closer to the recommended value, indicating the necessity to increase the number of language families included in the Native American CODIS database. The addition of a still greater number of language families to the CODIS database could undoubtedly provide a Native American database more representative of all Native Americans in the US and whose FST value is lower than the recommended value of 0.03. The use of frequency databases comprising additional language groups spoken by Native American profile contributors would provide more accurate random match probabilities, even if it does not account for the more rapid and complicated spread of languages than genes in North America between geographically disparate populations [Citation53]. The CODIS database should be expanded to cover other geographically and linguistically heterogeneous groups of Native Americans to reflect the levels of extant genetic diversity among them more precisely so that more accurate match probability estimates can be computed with adjustments for population structure.

Authors’ contributions

Jessica A. Weise collected the data and wrote the preliminary drafts. Jillian Ng analyzed the data and contributed to the revisions of the manuscript. Robert F. Oldt analyzed the data and contributed to the revisions of the manuscript. Kelly L. McCulloh assisted in the data collection Viray assisted in the data collection. Joy and revisions of the manuscript. David Glenn Smith edited the manuscript. Sreetharan Kanthaswamy conceived the idea as well as edited and wrote sections of the manuscript.

Compliance with ethical standards

The article involved work with archived human DNA samples. The study was conducted under the ethical approval of the Institutional Review Board of the University of California, Davis (IRB protocol no. 430207).

Acknowledgments

We would also like to thank Dr. Cecilia von Beroldingen for her insightful comments, which have helped improve this manuscript.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This study was funded by a National Institute of Justice grant [grant number 2014-DN-BX-K024] to Sreetharan Kanthaswamy, and a research grant from the UC Davis Forensic Science Graduate Program to Jessica A. Weise.

References

  • Budowle B, Shea B, Niezgoda S, et al. CODIS STR loci data from 41 sample populations. J Forensic Sci. 2001;46:453–489.
  • Gray RD, Atkinson QD, Greenhill SJ. Language evolution and human history: what a difference a date makes. Philos Trans R Soc B Biol Sci. 2011;366:1090–1100.
  • Zegura SL, Karafet TM, Zhivotovsky LA, et al. High-resolution SNPs and microsatellite haplotypes point to a single, recent entry of Native American Y chromosomes into the Americas. Mol Biol Evol. 2004;21:164–175.
  • Dimmendaal GJ. Historical linguistics and the comparative study of African languages. Amsterdam (The Netherlands): John Benjamins; 2011.
  • Wang S, Lewis CM, Jakobsson M, et al. Genetic variation and population structure in Native Americans. PLoS Genet. 2007;3:e185.
  • Creanza N, Ruhlen M, Pemberton TJ, et al. A comparison of worldwide phonemic and genetic variation in human populations. Proc Natl Acad Sci U S A. 2015;112:1265–1272.
  • Rowe BM, Levine DP. A concise introduction to linguistics. USA: Routledge; 2015.
  • Campbell L. American Indian languages: the historical linguistics of Native America. Oxford (UK): Oxford University Press; 1997.
  • Hammer MF, Redd AJ. Forensic applications of Y chromosome STRs and SNPs, national criminal justice re­ference service NCJ publication number 211979, 2006.
  • Redd AJ, Agellon AB, Kearney VA, et al. Forensic value of 14 novel STRs on the human Y chromosome. Forensic Sci Int. 2002;130:97–111.
  • Redd AJ, Chamberlain VF, Kearney VF, et al. Genetic structure among 38 populations from the United States based on 11 U.S. core Y chromosome STRs. J Forensic Sci. 2006;51:580–585.
  • Lorenz JG, Smith DG. Distribution of sequence variation in the mtDNA control region of Native North Americans. Hum Biol. 1997;69:749–776.
  • Smith DG. Examining the farming/language dispersal hypothesis. Am J of Hum Geneti. 2005;76:190–192.
  • Kemp BM, Gonzalez-Oliver A, Malhi RS, et al. Evaluating the farming/language dispersal hypothesis with genetic variation exhibited by populations in the Southwest and Mesoamerica. Proc Natl Acad Sci USA. 2010;107:6759–6764.
  • Lorenz JG, Smith DG. Distribution of four founding mtDNA haplogroups among Native North Americans. Am J Phys Anthropol. 1996;101:307–323.
  • Tamm E, Kivisild T, Reidla M, et al. Beringian standstill and spread of Native American founders. PLoS One. 2007;2:e829.
  • Barbujani G. DNA variation and language affinities. Am J Hum Genet. 1997;61:1011–1014.
  • Smith DG, Lorenz J, Rolfs BK, et al. Implications of the distribution of albumin Naskapi and albumin Mexico for new world prehistory. Am J Phys Anthropol. 2000;111:557–572.
  • Greenberg JH, Turner CG, Zegura SL. The settlement of the Americas: a comparison of the linguistic, dental, and genetic evidence. Current Anthropology. 1986 doi:10.1086/203472.
  • Cavalli-Sforza LL, Piazza A, Menozzi P, et al. Reconstruction of human evolution: bringing toge­ther genetic, archaeological, and linguistic data. Proc Natl Acad Sci USA. 1988;85:6002–6006.
  • Eshleman JA, Malhi RS, Johnson JR, et al. Mitochondrial DNA and prehistoric settlements: native migrations on the Western edge of North America. Hum Biol. 2004;76:55–75.
  • Forster P, Harding R, Torroni A, et al. Origin and evolution of Native American mtDNA variation: a reappraisal. Am J Hum Genet. 1996;59:935–945.
  • Wilson AC, Cann RL, Carr SM, et al. Mitochondrial DNA and two perspectives on evolutionary genetics. Biol J Linnean Soc. 1985;26:375–400.
  • Schroeder KB, Schurr TG, Long JC, et al. A private allele ubiquitous in the Americas. Biol Lett. 2007;3:218–223.
  • Roewer L, Nothnagel M, Gusmao L, et al. Continent-wide decoupling of Y-chromosomal gene­tic variation from language and geography in Native South Americans. PLoS Genet. 2013;9:e1003460.
  • Szathmary EJ. Peopling of Northern North America: clues from genetic studies. Acta Anthropogenet. 1984;8:79–109.
  • McCulloh KL, Ng J, Oldt RF, et al. The genetic structure of Native Americans in North America based on the Globalfiler• STRs. Leg Med (Tokyo). 2016;23:49–54.
  • Ng J, Oldt RF, McCulloh KL, et al. Native american population data based on the Globalfiler• autosomal STR loci. Forensic Sci Int Genet. 2016;24:e12–e13.
  • Rosenberg NA, Mahajan S, Gonzalez-Quevedo C, et al. Low levels of genetic divergence across geographically and linguistically diverse populations from India. PLoS Genet. 2006;2:e215.
  • Malhi RS, Schultz BA, Smith DG. Distribution of mitochondrial DNA lineages among Native American tribes of northeastern North America. Hum Biol. 2001;73:17–55.
  • Malhi RS, Breece KE, Shook BA, et al. Patterns of mtDNA diversity in northwestern North America. Hum Biol. 2004;76:33–54.
  • Moretti TR, Moreno LI, Smerick JB, et al. Population data on the expanded CODIS core STR loci for eleven populations of significance for forensic DNA analyses in the United States. Forensic Sci Int Genet. 2016;25:175–181.
  • Excoffier L, Laval G, Schneider S. Arlequin (version 3.0): an integrated software package for population genetics data analysis. Evol Bioinform Online. 2007;1:47–50.
  • Szpiech ZA, Jakobsson M, Rosenberg NA. ADZE: a rarefaction approach for counting alleles private to combinations of populations. Bioinformatics. 2008;24:2498–2504.
  • Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution. 1984;38:1358–1370.
  • Raymond MaR, Rousset F. GENEPOP (version 1.2): population genetics software for exact tests and ecu­menicism. J Heredity. 1995;86:248–249.
  • Glaubitz JC. CONVERT: a user-friendly program to reformat diploid genotypic data for commonly used population genetic software packages. Mol Ecol. 2004;4:309–310.
  • Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959.
  • Evanno G, Regnaut S, Goudet J. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol. 2005;14:2611–2620.
  • Campbell J. The hero with a thousand faces. New York (NY): Bollingen Foundation Incorporated; 1949.
  • Reich D, Patterson N, Campbell D, et al. Reconstructing Native American population history. Nature. 2012;488:370–374.
  • Budowle B, Chidambaram A, Strickland L, et al. Population studies on three Native Alaska population groups using STR loci. Forensic Sci Int. 2002;129:51–57.
  • Rosenberg NA, Pritchard JK, Weber JL, et al. Genetic structure of human populations. Science. 2002;298:2381–2385.
  • Rangel-Villalobos H, Martinez-Sevilla VM, Martinez-Cortes G, et al. Importance of the geographic barriers to promote gene drift and avoid pre- and post-Columbian gene flow in Mexican native groups: evidence from forensic STR loci. Am J Phys Anthropol. 2016;160:298–316.
  • Slatkin M, Barton NH. A comparison of three indirect methods for estimating average levels of gene flow. Evolution. 1989;43:1349–1368.
  • Bryc K, Durand EY, Macpherson JM, et al. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am J Hum Genet. 2015;96:37–53.
  • Haynie HJ. Geography and spatial analysis in historical linguistics. Lang Linguistics Compass. 2014;8:344–357.
  • Hunley K. Reassessment of global gene-language coevolution. Proc Natl Acad Sci USA. 2015;112:1919–1920.
  • Green L. Choctaw removal was really a “trail of tears”. Bishinik. 1978;Nov:8–9.
  • Tubby R, Wells S. After removal: the Choctaw in Mississippi. Jackson (MS): University Press of Mississippi; 2004.
  • Smith DG, Malhi RS, Eshleman J, et al. Distribution of mtDNA haplogroup X among Native North Americans. Am J Phys Anthropol. 1999;110:271–284.
  • Flickinger GM, Yarbrough KM. Dermatoglyphics of Apache and Navajo Indians. Am J Phys Anthropol. 1976;45:117–121.
  • Hunley K, Long JC. Gene flow across linguistic boundaries in Native North American populations. Proc Natl Acad Sci USA. 2005;102:1312–1317.