4,475
Views
11
CrossRef citations to date
0
Altmetric
Research Article

Has demography witnessed a data revolution? Promises and pitfalls of a changing data ecosystem

Abstract

Over the past 25 years, technological improvements that have made the collection, transmission, storage, and analysis of data significantly easier and more cost efficient have ushered in what has been described as the ‘big data’ era or the ‘data revolution’. In the social sciences context, the data revolution has often been characterized in terms of increased volume and variety of data, and much excitement has focused on the growing opportunity to repurpose data that are the by-products of the digitalization of social life for research. However, many features of the data revolution are not new for demographers, who have long used large-scale population data and been accustomed to repurposing imperfect data not originally collected for research. Nevertheless, I argue that demography, too, has been affected by the data revolution, and the data ecosystem for demographic research has been significantly enriched. These developments have occurred across two dimensions. The first involves the augmented granularity, variety, and opportunities for linkage that have bolstered the capabilities of ‘old’ big population data sources, such as censuses, administrative data, and surveys. The second involves the growing interest in and use of ‘new’ big data sources, such as ‘digital traces’ generated through internet and mobile phone use, and related to this, the emergence of ‘digital demography’. These developments have enabled new opportunities and offer much promise moving forward, but they also raise important ethical, technical, and conceptual challenges for the field.

Introduction

Writing on the 30th anniversary of the Population Association of America’s flagship journal Demography, Eileen Crimmins remarked that ‘the world of information processing in 1994 differs dramatically from that of 1964’ (Crimmins Citation1993, p. 579). This largely exogenous factor, she claimed, had shaped ‘changes in demographic analysis over the past 30 years’ by affecting both the kinds of data that were available to demographers and the modes of analysis they used (Crimmins Citation1993, p. 579). At the time of Crimmins’s piece, desktop computers—which had decentralized information storage, data processing, and analysis for demographers away from mainframe computing—had become widespread, but the (commercial) internet was still a nascent technology and about 3 per cent of the global capacity to store information worldwide was digital (Hilbert and López Citation2011). The 1990s, however, marked the onset of the digital revolution, which saw radical transformations in information storage, transmission, and computational power: by 2007, the world’s information storage capacity was over 15 times greater than in the early 1990s, with 97 per cent of information storage in digital form. The most staggering increases were in computational power: compared with 1993, an average computer in 2007 was over 1,200 times faster (Hilbert and López Citation2011).

The digital revolution and its accompanying improvements in technological capacities ushered in what has been called the era of ‘big data’ or in other formulations, a ‘data revolution’. Definitions of big data are often ambiguous, and the terms big data and data revolution had both been used prior to the late 1990s (e.g. looking at the relative prevalence of specific words or combinations of words in English language books since 1800 with Google Ngrams, there are two peaks for the terms big data and data revolution, the first in the 1980s and the second in the 2000s (see Google Ngram viewer Citation2021)). However, numerous commentaries and papers in the new millennium heralded these developments as the beginning of a new era for the social sciences (e.g. Laney Citation2001; Lazer et al. Citation2009; OECD Citation2013; Sagiroglu and Sinanc Citation2013; Einav and Levin Citation2014; Kitchin Citation2014; Connelly et al. Citation2016; Billari and Zagheni Citation2017; Alburez-Gutierrez et al. Citation2019). While different commentaries emphasize different features of the data revolution, a useful formulation is provided in the United Nations (UN) report A World that Counts, written in the context of setting the post-2015 international development agenda; this defines the data revolution as ‘an explosion in the volume of data, the speed with which data are produced, the number of producers of data, the dissemination of data, and the range of things on which there is data, coming from new technologies such as mobile phones and the “internet of things”, and from other sources, such as qualitative data, citizen-generated data and perceptions data’ (IAEG Citation2014, p. 6).

This description highlights how data in this era are more widely available and timelier, with greater variety in terms of their producers but also the kinds of information they provide. The report envisions tremendous potential in the ability to advance our understanding of all populations (‘no one should be invisible’, p. 22) and to use this knowledge to inform policymaking in the service of global sustainable development.

At the outset it would seem that demographers—who have always been quite data savvy and whose professional outputs have long been shaped by available information processing and analysis technology—would have much to benefit from the data revolution. But how has the data revolution been a new revolution for demographers, in the way that has been heralded for other social sciences? This is a question worth asking with at least some degree of critical reflection, because many of the features used to define the data revolution were arguably not novel for demographers, even in the 1990s. On one hand, if we think of the data revolution in terms of the proliferation of big data—as defined in terms of their volume or scale— demographers, in their quest for population measurement, have been using big data based on full enumeration (e.g. censuses or complete-coverage vital registers) for a long time, even for centuries in some European countries. Even demographic sample surveys have tended to have large samples, given the need for population generalizability. On the other hand, we may think of the data revolution in terms of the increase in volume and variety of ‘found’ (Connelly et al. Citation2016) or ‘ready-made’ data (Salganik Citation2019), such as those emerging as by-products of government, administrative, or digital transactions, which increasingly social scientists seeking a unifying definition of big data have tended towards. From this perspective the idea of repurposing data originally meant for something other than research is not new to demographers, who have long been avid consumers of administrative data, such as population registers or, before the state became the monopolist of record-keeping, parish registers. Through the use of these data and other types of often-imperfect data, demographers had already by the middle of the twentieth century developed skilful techniques for repurposing data, whether in the tradition of indirect estimation following on from the work of William Brass (Brass Citation1996) or in the meticulous linking of parish registers by Henry in France (Henry Citation1967) or Wrigley, Schofield, and the Cambridge Group for the History of Population Structure in England (Wrigley and Schofield Citation1983). These examples may lead us to think that demographers were using big data before it became fashionable, and the data revolution in demography perhaps happened well before it did in the other social sciences.

So—did the data revolution happen in demography before it happened elsewhere? Or looking at the pages of Population Studies and other demographic journals, do we see signs of a ‘new’ data revolution in the field in the past 25 years? This paper attempts to address this question by reflecting on how the data ecosystem for demographic research has developed and how these developments have affected what demographers are able to measure, describe, and explain. My discussion is anchored around two dimensions of these developments. First, I outline changes in the kinds of data collected in and available from long-established (‘old’) sources that have conventionally informed demographic research: censuses, population registers, surveys, and civil registration systems. Next, I describe how demographers are beginning to use ‘new’ big data sources—enabled through the use of the internet and mobile phones and the growing digitalization of social life—and the emergence of the area of ‘digital demography’. I conclude by discussing the opportunities and challenges that the changing data ecosystem raises.

Data paradigms in demography

Before reflecting on how the landscape of demographic data has changed since the 1990s, it is helpful to understand where the field was in the mid-1990s, on the occasion of the 50th anniversary edition of Population Studies, and the changes that had occurred in the preceding decades. Demography has experienced key shifts in data paradigms over time (Billari and Zagheni Citation2017). These shifts are visible on the pages of Population Studies through the years. Prior to the mid-1970s, most papers in the journal relied on aggregate-level data, generally drawing on data sources such as censuses or other administrative records, and were focused on measuring and describing population-level processes and parameters. Often data (e.g. from censuses) were only available to demographers in tabulated form, and even if individual-level data were available, the data were deployed to characterize population-level patterns and, in many cases, examined with careful scrutiny to check their plausibility and quality. While description is often derided in other social sciences, a lot of the work published in the journal was precisely that—good description of population-level indicators and processes—although, as Samuel Preston noted in his contribution reviewing changes in mortality research in the journal’s 50th anniversary edition, the distinction ‘between description and analysis was in any event imprecise’ (Preston Citation1996, p. 529). Where data permitted, differentials or heterogeneity were studied, but opportunities for multivariate analyses were limited because of the aggregated nature of available data.

By the last quarter of the twentieth century, however, a second type of data paradigm—one that relied on individual-level data from sample surveys—had become well established, having benefited from the development in statistical sampling techniques in the post-war period. The World Fertility Survey (WFS), the predecessor to the ongoing and widely used Demographic and Health Survey (DHS), was conducted across 62 countries between 1974 and 1986 and was a first-of-its-kind attempt at fielding a comparable, cross-national survey to study fertility and its determinants (Cleland Citation1996). With the availability of surveys such as the WFS in the 1970s and 1980s, individual-level (statistical) analyses of survey data, including event history analyses of retrospective birth or life histories within them, became increasingly common. This shift in the demographic data paradigm was duly noted in a number of contributions in the 50th anniversary edition (e.g. Cleland Citation1996; Coale and Trussell Citation1996; Preston Citation1996).

Although the greater availability of sample surveys, especially for low-income countries where data were otherwise lacking or defective, was seen as a favourable development for demographic research, I get a sense reading the 1996 issue that the authors also viewed this development of the micro-level paradigm based on surveys within demography with some degree of caution. Cleland, in his paper on demographic data collection, claimed that while the WFS/DHS model had made ‘immense contributions to our understanding of the processes of family formation, proximate determinants of fertility and socio-economic correlates of child mortality and childbearing’, he wondered if something less expensive—‘a simple household survey with larger samples’—might have better served the provision of demographic measures (Cleland Citation1996, p. 445). While this was a question about data, ultimately it was also a question about the aims of demographic research or what they ought to be—measurement and description at the macro level, the pursuit of explanation at the individual or micro level, or something else?—a debate that has continued within the field (Billari Citation2015). Preston (Citation1996, p. 535) envisioned an imminent decline in ‘the type of individual-level analyses of childhood mortality, so prominent in the 1980s’, as most of the relationships had been uncovered and found to be ‘remarkably consistent across time and space’. He saw diminishing returns to the standard regression-type statistical approaches unless some other research designs were applied. Coale and Trussell (Citation1996), in their contribution on demographic models, wondered if the shift to micro-level analyses with survey data meant that issues of data quality and lessons learned from the use of macro-models for checking the validity and consistency of data were being neglected.

Fast forward 25 years and the reliance on individual-level survey data analysis within the demographic data ecosystem has definitely persisted but, as I outline in what follows, surveys have also changed in significant ways. The incorporation of different types of data (e.g. biosocial data) within surveys, along with the ability to link surveys with contextual data sources (e.g. geospatial and administrative data), have generated opportunities to address research questions with novel research designs. The increasing complexity of surveys and growing interest in new types of digital data sources have meant that issues such as selection, representation, and bias have yet again come to the forefront, and linked to this, questions of data quality, validity, and consistency are of renewed significance.

‘Old’ population data, new features

Censuses and large-scale administrative microdata

Starting in 2000, an exponential increase occurred in the public availability of individual-level records of ‘old’ big data sources, such as censuses, administrative registers, and large-scale surveys. Ruggles has described this in terms of an explosion in the availability of ‘big microdata for population research’ (Ruggles Citation2014). While about 100 million individual-level records were publicly accessible to the research community in 2000, by 2018 this number was estimated to exceed 2 billion, covering over 100 countries (Ruggles Citation2014). The availability of large-scale, even complete-enumeration, census samples has clearly benefited from the improvements in information storage, data processing, and extraction provided by the digital revolution. These technological improvements facilitated the development of several key data infrastructure projects that emerged in the late-1990s, including the Integrated Public Use Microdata Series (IPUMS) at the University of Minnesota, the North Atlantic Population Project (NAPP) for historical eighteenth- to early-twentieth-century European and North American census samples (Ruggles et al. Citation2011), and the Integrated Census Microdata (I-CeM) project for historical Britain (Schürer and Higgs Citation2014).

The idea of census records being released at the individual level, rather than in aggregated or tabulated form, was itself not new in the 1990s. From 1962 the United States (US) Census Bureau provided a 1-in-1,000 sample of long-form records from the 1960 Census to researchers, on computer tapes, and as costs of information storage and processing fell in the 1960s and 1970s, the scale of census microdata from the US Census Bureau continued to increase. In 1974 Statistics Canada made public use microdata files available for the Canadian 1971 Census, and a sample of anonymized records from the UK’s 1991 Census was made publicly available in 1993. Other statistical agencies in a handful of countries also made internal microdata available to academic researchers by special arrangement (Ruggles Citation2014). Before 2000, however, statistical agencies in most countries ‘had no systematic program for preservation or reuse of census microdata’, and as a result, ‘most machine-readable census data from the 1960s and 1970s had already disappeared by the 1990s’ (Ruggles Citation2014, p. 289).

In light of this, it is a remarkable achievement that over the past two decades IPUMS has built a public web-accessible repository that now provides the research community access to anonymized census and/or large-scale survey samples for nearly 100 countries, many of which are low- and middle-income countries (LMICs) where data availability was previously much more limited. Through the IPUMS data storage, processing, and extraction systems, data are interoperable and harmonized across time and space, and data sets can be downloaded swiftly from the internet. The availability of census samples has fuelled an increase in publications for parts of the world (e.g. Latin America, Africa) where previously the DHS was the only available demographic data source. Another category of microdata that has experienced tremendous growth is historical census microdata, which have become available through collaborations between researchers and internet-based genealogical organizations such as ancestry.com or findmypast.com. As historical samples face few confidentiality restrictions—often including information on names and dates of birth—attempts at linking individuals across census samples have also been made. In both historical and contemporary samples, individuals can be placed and related to others within their co-residential household context.

The proliferation of individual-level, high-density census samples has facilitated at least three types of opportunities for demographic research, examples of each of which are now visible in journals, including Population Studies.

First, the large, often complete, coverage of census samples has been helpful in analysing subpopulations with greater precision and has provided researchers with the flexibility to define and explore heterogeneity on their own terms for specific subgroups, thereby capturing previously under-studied differentials that could not be analysed with aggregated group-level data or using survey-based samples. For example, studies using census data have allowed for better statistical visibility and analyses of non-traditional family forms, such as same-sex couples (e.g. Festy Citation2007; Rosenfeld Citation2010) and non-cohabiting marriages (e.g. Ferrari and Macmillan Citation2019). The wider availability of individual-level samples for non-Western countries and historical populations has enabled multivariate analyses for populations where such analyses were much less widespread. However, the greater availability of census samples has also led to the return and more detailed applications of indirect estimation methods originally developed for macro-level measurement, such as the own-children method (OCM) for estimating fertility. For example, Reid et al. used historical census data and OCM to estimate age-specific fertility and analyse differentials in age-specific fertility by place of residence and social class in the 1911 Census in England and Wales (Reid et al. Citation2020). Guilmoto used microdata from census samples to apply OCM to analyse gender bias in reproductive behaviour by reconstructing family relationships (e.g. sibling order) and sex from household information provided in census records (Guilmoto Citation2017). Confidence intervals around indicators commonly used to study son preference in reproductive behaviours (e.g. conditional sex ratios) tend to be large due to random variation arising from small sample sizes in surveys; this complicates the meaningful detection and interpretation of gender bias. Larger census samples offer ways to overcome these limitations.

Second, census data have enabled the incorporation of contextual, or more macro-level, characteristics—either at the household level, such as through information on co-residential members, or at different geographical levels—to understand and explain variability in demographic outcomes. They have also provided ample opportunities for the study of living arrangements and household contexts, thus becoming a primary resource for family demographers. They have helped to explore interactions between individual and geographical characteristics for understanding variations, for example in fertility over the course of historical fertility transitions through the use of multilevel models (e.g. Dribe et al. Citation2014; Klüsener et al. Citation2019). In addition, they have facilitated more geographically precise estimates of indicators linked to fertility (e.g. Schmertmann et al. Citation2013), internal migration (e.g. Rodríguez-Vignoli and Rowe Citation2018), and human development (e.g. Permanyer Citation2013) at subnational levels, thus helping to shift the conventionally national focus of demographic analysis. In general, the use of data with lower levels of aggregation that can be aggregated to higher levels allows for the adoption of a multilevel paradigm for demographic analysis and provides an opportunity to integrate micro- and macro-level traditions (Courgeau et al. Citation2007). Inferences at one level of aggregation may not generalize to other levels—but to be able to understand and explain these differences, we first need to be able to detect them.

Third, and perhaps the most striking feature of the big population microdata made available through data infrastructures such as IPUMS is that they are harmonized and comparable across time and space, and significantly easier to link with other spatio-temporally referenced data. In the case of some historical census samples, linking individuals and their children across censuses has also been carried out (largely for fathers, however, as linkage typically involves names, and women tend to change name on marriage); this has led to the emergence of multigenerational research (Ruggles et al. Citation2018). The growing accumulation of cross-national, harmonized, and standardized individual-level data is a development that is not restricted to census microdata, but also applies to widely used survey data sets, such as the DHS and several health and ageing studies. Comparative cross-national research and studies covering changes over decades of time have flourished across demographic journals as a result of these developments. The accumulation of multiple samples across time and space has led researchers in some cases to drop the time dimension altogether and explore, for example, how demographic patterns linked to internal migration (e.g. Bell, Charles-Edwards, Ueffing et al. Citation2015) or family forms vary as a function of generalized macro-level social changes, such as economic and human development (e.g. Ferrari and Macmillan Citation2019; Pesando Citation2019) or educational expansion (e.g. Esteve et al. Citation2016). Demographers have long been interested in characterizing population change over time and space in the pursuit of empirical regularities in population dynamics and have been drawn to ideas of convergence across countries or regions in demographic variables as encapsulated by the idea of the demographic transition. With the accumulation of these data sources, investigations of these theories of demographic convergence are now empirically informed by data with significant global coverage. Empirical projects that have leveraged the accumulation of comparative cross-national data sets have often found that significant diversity remains unexplained by either economic or social variables, that convergence in demographic indicators is far from straightforward, and that heterogeneity, often between regions, is persistent (e.g. Esteve et al. Citation2012; Moultrie et al. Citation2012; Pesando Citation2019).

Although the harmonization of ‘old’ big data has been an exciting development and has injected new resources to sustain demographers’ ambitions to describe empirical regularities in populations, the value of this harmonization has been questioned in some cases. The ability to harmonize across census or survey samples has emerged from a sustained push towards the standardization of census questionnaires within the international statistical system, as well as a wider push for cross-national comparability within the global sustainable development agenda. However, whether this standardization yields measures and comparisons that are socially meaningful and reflect lived experiences is not always clear-cut. This point is illustrated in debates surrounding the utility and meaning of the concept of the household as used in censuses and cross-national standardized surveys, especially in the context of sub-Saharan Africa, where family living arrangements and union statuses are much more diverse and nuanced than those captured in these instruments (Randall et al. Citation2011; Randall and Coast Citation2015; Hertrich et al. Citation2020). In contrast to these national population-wide data, longitudinal data collection endeavours in smaller geographical areas—as exemplified by the Health and Demographic Surveillance System (HDSS) sites spanning several LMICs within the International Network for the Demographic Evaluation of Populations and their Health (INDEPTH) (Sankoh and Byass Citation2012)—have more successfully incorporated culturally specific measures of household and family context while also capturing changes in households over time, for example due to migration (e.g. Townsend et al. Citation2002; Hosegood and Timæus Citation2006). However, the intensive resources and follow-up needed to run HDSS sites imply that these are generally limited to localized communities, and their wider generalizability to surrounding areas remains unclear.

The discussion so far has focused on the greater availability and harmonization of census microdata samples in the public domain. However, a separate question is whether the traditional full-enumeration decennial census model will continue as a method of administrative data collection and, relatedly, remain relevant as a data source for demographic research moving forward. Census-taking is a resource-intensive and logistically complex yet also political activity. Concerns about escalating costs, declining response rates, and privacy issues, as well as the need for more timely data and the excitement surrounding ‘new’ big data sources (such as mobile phones, satellite data, or web data) have regularly sparked political contestations and scientific discussions around the utility and sustainability of traditional censuses in recent decades, leading some demographers to speculate an imminent ‘twilight’ of the census (e.g. Coleman Citation2013). Nevertheless, based on censuses undertaken in the 2010 census round (censuses covering 2005–14), the traditional census model of full-field enumeration has so far persisted. Although several countries have postponed their census dates in response to Covid-19 (UNECE Citation2021), this model seems likely to remain the dominant one for the 2020 census round (covering 2015–24). While the 2000 census round saw 28 countries unable to undertake any census due to political instability or lack of human resources, this number declined to 11 in the 2010 census round, in which 227 out of 241 countries (94 per cent) undertook a census of some form (Kukutai et al. Citation2015).

This expansion of censuses across the world in the 2010 round has nonetheless occurred alongside a diversification of census methods towards the combination of full enumerations with administrative registers or sample surveys, or the replacement of census-taking altogether with population registers, as accomplished by the Nordic countries. Kukutai et al. (Citation2015) classified 39 countries out of the 227 that undertook a population census in the 2010 round as having conducted one using an alternative method. This experimentation with alternative modes of population enumeration is most advanced in Europe, where ‘a history of maintaining population registers and a broader public acceptance of personal data for statistical purposes’ has made this possible (Kukutai et al. Citation2015, p. 5). This diversification highlights a stark inequality between high- and low-income countries in the tools available for population enumeration. In contexts such as the Nordic countries, the replacement of censuses with population registers has enabled the production of high-quality, regularly updated population data; however, in low-income countries (particularly in sub-Saharan Africa), accurate and complete censuses are crucially needed but ‘likely to be least well-resourced or well-equipped’ (Moultrie Citation2016, p. 259). As Moultrie noted, in addition to significant lags between censuses in sub-Saharan Africa, the estimated completeness of census enumeration (a key indicator of data quality) remains unknown, as few countries conduct post-enumeration surveys and, for those that do, results are not reported transparently. This unreliability has ramifications for nationally representative surveys such as the DHS, for which the census serves as the sampling frame. Further, even as the core of census collection is affected by data quality concerns due to burdens facing national statistical offices (NSOs), paradoxically the demands on censuses to collect further information (e.g. on mortality) have only increased, due to continued deficiencies in vital registration systems.

To some extent, the opportunities presented by new technologies offer considerable possibilities to improve cost efficiency for census operations, although their adoption remains uneven and affected by available resources and technical capacity. The most significant development in the use of technology in recent census rounds has been the use of cartographic and geospatial tools (GPS, GIS, satellite imagery, and aerial photography) for planning and improved quality of mapping capacity (United Nations Statistics Division Citation2013). While this mapping has helped census planning operations, the use of remotely sensed data from satellite images in combination with administrative-unit based census data has also enabled the production of spatially disaggregated or gridded population estimates as outputs which more accurately represent the distribution of populations in space (Stevens et al. Citation2015; Wardrop et al. Citation2018). Ultimately the success of these ‘top-down’ techniques for producing high-resolution population estimates depends significantly on the quality of the census inputs and the overlap of census counts with mapped administrative units, which may be outdated in many low-income countries. In such settings, where full-enumeration census counts may be outdated or unavailable, ‘bottom-up’ approaches have been developed that use geospatial covariates extracted from satellite data in combination with census-like, population survey data available for a sample of areas to predict high-resolution population counts (Lloyd et al. Citation2017; Wardrop et al. Citation2018; Leasure et al. Citation2020).

In the context of enumeration, technological innovations adopted in the 2010 round included the use of handheld devices and internet-based questionnaires, as well as the use of mobile phones for monitoring of field operations. The use of handheld and mobile devices with inbuilt georeferencing capabilities and the incorporation of auxiliary information such as time and date provided opportunities for improved data quality and validation checks. Optical data capture and web-based data dissemination were among other types of technological innovations adopted by NSOs. Although face-to-face interviews using paper questionnaires remained the most common mode of enumeration in censuses, internet-based enumeration became the second most common mode, although this trend varied strongly by region (United Nations Statistics Division Citation2013). While 44 per cent of countries in Europe and one-third of those in Asia offered internet-based enumeration, this option was not offered in South America or Africa. However, no country has so far used the internet as the sole mode of enumeration. The use of internet-based enumeration seems likely to be higher in the 2020 census round, and while this will have positive implications for the efficiency and timeliness of data collection, the use of multiple parallel modes of data collection also raises challenges. These include issues linked to the development of separate strategies, planning methods, and different skills and expertise for each mode, as well as data comparability and quality issues linked to mode effects. Although the intensification of the use of technological innovations in census-taking holds great promise, ‘such use also requires additional efforts to ensure that the planning, development, testing and implementation of these different applications is successfully achieved’ (United Nations Statistics Division Citation2020, p. 10).

While censuses, albeit in incrementally diversifying forms, remain the dominant mode of population enumeration in most of the world, population registers in Nordic countries and in other European settings, such as the Netherlands (e.g. Bakker et al. Citation2014), highlight the significant value of government administrative data. Through personal identification numbers unique to individuals, these longitudinal population registers allow for deterministic linkage of multiple administrative registers, thereby enabling analysis of multiple domains of the life course or linkage of individuals across generations. These data sources remain, in some sense, the gold standard for administrative data and one for other countries to aspire to, as the availability and use of individual-level administrative data continues to expand. The Nordic registers highlight the tremendous opportunities afforded by complete-population administrative registers, and they have facilitated the use of novel research designs including those involving multiple generations (e.g. Kolk Citation2014) or extended kin (e.g. Barclay et al. Citation2020), and the creation and linkage of unique kinds of contextual variables linked to neighbourhoods or workplaces (e.g. Lyngstad Citation2011; Holmlund et al. Citation2013). Even these gold-standard data, however, are not without their flaws, and over-coverage—arising from migration dynamics in which individuals who are not resident remain registered—is an issue that can introduce bias (Monti et al. Citation2019).

Ultimately, administrative registers—while ‘big’ and ‘deep’, to borrow Ruggles’ characterization of big population data, in terms of their coverage of people, events, or transactions—are shallow at capturing motivations, intentions, attitudes, and other subjective characteristics, something that surveys do better. A promising new direction to reinforce the complementarities between these two types of data and enhance the scope of administrative registers is through their linkage with survey data sets used in demographic research, especially in Europe: for example in the Netherlands (e.g. Bakker et al. Citation2014), in the UK, with the linkage of the Millennium Cohort Study (Tate et al. Citation2006) and the British Household Panel Study (Sala et al. Citation2012), and in the Nordic countries, through linkage with the Gender and Generations Survey (Gauthier et al. Citation2018). As this linkage requires consent from survey participants, non-consent is likely to be an important source of bias in this approach. This selection bias, however, is likely to be negligible in contexts with a longer history of open, research-accessible administrative data (e.g. the Nordic countries) and more salient in other contexts where these developments are still new and public acceptance weaker (e.g. the UK) (Sala et al. Citation2012; Mostafa Citation2016).

Surveys

By the 1990s sample surveys had become the most widely used method of demographic data collection in high-income countries as well as LMICs, especially in the context of fertility and family research. In Europe and North America, surveys had also moved to longitudinal data collection in the 1970s, with the use of prospective cohort studies and other instruments that shifted the focus of analyses from demographic status at one time point to changes over time through the collection of repeated measurements for the same individuals (Crimmins Citation1993). The kinds of surveys used by demographers have generally been broad, often multipurpose, high-quality probability samples with carefully designed sampling frames that are nationally representative, with detailed questionnaires and complex designs. This type of data collection is highly resource intensive, as in addition to the efforts needed for appropriate sampling and planning, the recruitment of respondents for such surveys requires repeated callbacks and refusal conversion to ensure satisfactory response rates (Groves Citation2006). Standardized, nationally representative surveys widely used by demographers to study LMICs (e.g. the DHS) have also served multiple purposes, not just in terms of the topics that they cover but also in their use for both measurement and explanation of demographic outcomes. The DHS was already being used in the 1990s for estimating child mortality and fertility from retrospective birth histories, as well as to analyse theory-driven determinants of variation in these outcomes. The use of surveys for the measurement of demographic indicators was the consequence of weak or absent civil registration systems or inadequate census-based measures, along with the realization that indirect methods of demographic estimation applied to individual-level survey samples yielded reasonably good results (Cleland Citation1996).

Perusing the pages of Population Studies, it is evident that the use of sample surveys (both cross-sectional and longitudinal studies) has remained widespread in demographic research over the past 25 years. Nevertheless, concerns about declining response rates that are often non-random (especially in high-income countries), increasing costs associated with surveys due to additional recruitment efforts, worsening attrition in panel surveys, and lags in both data generation and reporting have raised challenges for the survey model of social research (Tolonen et al. Citation2006; Groves Citation2011; Mostafa and Wiggins Citation2015; Gauthier et al. Citation2018). Some of these discussions are quite similar to those described in the context of the relevance of censuses. Response rates in surveys in LMICs (e.g. the widely used DHS) remain high, but significant costs—and continued external technical assistance from the global North to the global South through aid agencies such as USAID—are needed to sustain them (Corsi et al. Citation2012; Short Fabic et al. Citation2012). Even in LMICs, urban residents may be less willing to respond, and panel attrition is often high and non-random, as seen in the longitudinal Study on Global Ageing and Adult Health (SAGE) (Kowal et al. Citation2012) and other household surveys (e.g. Alderman et al. Citation2001), although the impacts of attrition bias for coefficients in multivariate analyses are generally minimal (Alderman et al. Citation2001). Furthermore, longitudinal studies in LMICs, such as those collected by HDSS sites in the INDEPTH network, face significant challenges linked to the recruitment and retention of skilled personnel for data collection and management (Sankoh and Byass Citation2012).

Even as respondents have become less willing to respond, researchers are asking for and collecting more types of information from them. Since the 1990s, surveys designed and used by demographers have increased in scope and become more varied in the kinds of data they collect, for example through the integration of biological or psychological measures at the individual level, richer contextual features such as geographical measures, or information from other household members or peers. The growing detail and complexity of survey instruments has meant that the coordination of these studies requires greater collaboration and cooperation in large interdisciplinary teams. The inclusion of new kinds of data in surveys, extending beyond self-reported indicators—for example geographical coordinates or biospecimens (e.g. blood, saliva samples, or genetic data)—has also made procedures linked to seeking informed consent and protecting confidentiality of participants more demanding.

Changes occurring to the DHS exemplify many of the trends affecting surveys used in demographic research more broadly. Since the mid-1990s, the DHS programme has added new questions into the standard and household questionnaires, introduced new modules on behaviour (e.g. gender-based violence, women’s status, alcohol consumption), and integrated biomarkers covering a range of different health conditions (e.g. STIs, malaria, measles, and chronic conditions such as diabetes) across successive phases, moving beyond the collection of anthropometrics alone. The number of countries with sibling histories in the DHS for indirect measurement of maternal mortality and adult mortality has also grown, although concerns about data quality and mortality underestimation with these data remain (Bicego Citation1997; Timæus and Jasseh Citation2004; Masquelier Citation2013). In the first phase of the DHS, the household questionnaire consisted of 25 questions, and by the seventh phase it included 131 questions, after a peak of 226 in phase five. The increased length and complexity in the questionnaire has raised recurring discussions and concerns about deteriorating data quality (Pullum et al. Citation2013). Nevertheless, trends from more recent phases of data collection indicate that the general quality of the data in DHS surveys remains high, albeit with higher levels of missingness in some variables (e.g. use of antenatal healthcare) (Pullum Citation2019). While the quality of age and date information varies across regions, and issues of backward displacement of births or omission of recent births remain, these are relatively small and do not show systematic trends over time (Pullum and Becker Citation2014; Pullum and Staveteig Citation2017). Moving forward and looking more broadly across different surveys, it is quite possible that questions of data quality may become more significant and maintenance of quality standards may require more intensive efforts in light of the continued complexity of survey questionnaires.

Reflecting these changes in DHS questionnaires, the topics of studies using the DHS in demographic journals have also diversified. Of 114 papers using the DHS published in four demographic journals (Population Studies, Demography, Population and Development Review, and Demographic Research) between 1998 and mid-2020, 62 (54.4 per cent) were in the area of fertility, with child health the next most common category. Gender (e.g. Kishor and Johnson Citation2006), reproductive health (e.g. Magadi et al. Citation2003), and HIV and sexual behaviours (e.g. Bongaarts Citation2007) have also emerged as significant themes of study. In 1996 the DHS began collecting GPS coordinates of cluster locations (primary sampling units) and in 2003, georeferenced data sets became available, which made these coordinates available to researchers (with random coordinate displacement added to the data sets to protect respondent privacy). This enabled the layering of contextual or geographical features to the individual characteristics collected in the surveys, for example those linked to climate and the environment, proximity to health infrastructure and institutions, and other aspects of the built environment (e.g. Hathi et al. Citation2017; Østby et al. Citation2018; Andriano and Behrman Citation2020; Grace et al. Citation2021). Some of this geospatial augmentation of survey data has led to the integration of surveys with new types of remote sensing data, such as the use of night-time lights observed via satellites (e.g. Dorélien et al. Citation2013; Rotondi et al. Citation2020).

The accumulation of large bodies of individual-level data with the potential for augmentation through different types of linkage has also bolstered the application of individual-level causal analysis approaches in demography, drawing from the potential outcomes framework (Neyman–Rubin–Holland model) that has seen widespread adoption in economics. Although the applicability and relevance of this model of causation—which relies extensively on ideals of experimentation for the interpretation of causal effects and is grounded at the individual level—has been challenged by demographers (e.g. Bhrolcháin and Dyson Citation2007), the pursuit of quasi-experimental or natural experiment approaches for exploiting exogenous variation has clearly witnessed an increase, on looking across demographic journals (e.g. Torche Citation2011; Andriano and Monden Citation2019; Polos and Fletcher Citation2019). The availability of data has played an important part in this rise, as the detection of natural experiments or identification of specific subpopulations exposed to a particular type of treatment or intervention (e.g. school reform, health policy, economic shock) in otherwise broadly multipurpose data sets requires enough data to be able to implement these research designs. Research designs paying closer attention to issues of selection, such as within-family designs, have increased, with the advent of survey data sets where multiple observations are available and individuals can be linked to other family or household members (e.g. siblings) (e.g. Barclay and Myrskylä Citation2016; Rana et al. Citation2021). Similarly, the growing availability of prospective longitudinal studies, especially outside Europe and North America, has allowed for designs to enable analysis of changes in individual-level outcomes over time and for time-invariant fixed characteristics to be controlled while examining the effects of time-varying exogenous factors (e.g. Frankenberg et al. Citation2005; Song and Burgard Citation2008; Saha and van Soest Citation2011). Longitudinal data collections in LMICs—such as the 49 INDEPTH HDSS sites across 19 countries and also collections in China (e.g. China Family Panel Studies, China Health and Retirement Longitudinal Study), Indonesia (e.g. Indonesia Family Life Survey), and India (e.g. India Human Development Survey)—have provided new opportunities for understanding health, ageing, and family processes in settings where data have otherwise been sparse or limited to cross-sectional surveys.

The growing variety of data collected in surveys has also paved the way for new opportunities in interdisciplinary research, for example in the realm of biodemography. These developments are perhaps most prominently illustrated by research on health and ageing, where the integration of biomarkers linked to metabolic, cardiovascular, immune, and physical function in several population surveys has provided a deeper understanding of the physiological mechanisms and pathways by which social and demographic factors ‘get under the skin’ to affect health (Crimmins et al. Citation2010). Furthermore, these measures have provided an empirical basis for measuring and understanding biological risk and frailty, a key component of formal demographic models of mortality (Vaupel et al. Citation1979), and have helped to model the ‘morbidity process’ leading to mortality (e.g. Turra et al. Citation2005; Crimmins et al. Citation2010). A significant development for further strengthening the interface between demography and biology, with the potential to extend far beyond health research, has been the inclusion of genome-wide genetic data in large-scale longitudinal population surveys, such as the US Health and Retirement Study (Hauser and Weir Citation2010) and British cohort studies (O’Neill et al. Citation2019). In contrast to earlier studies analysing genetic influences on demographic behaviours (which relied on specialized samples, such as twins), genome-wide molecular genetic data offer the potential to examine heritability of traits in larger samples of unrelated individuals and to incorporate genetic measures (e.g. polygenic scores) into research designs to explore fundamental questions about the interaction between genes and the social environment (Mills and Tropf Citation2020). While still incipient, papers using genetic measures and exploring gene–environment interactions are now visible on the pages of demographic journals (e.g. Gaydosh et al. Citation2018; Fletcher Citation2019), and demographers have also led the discovery of the genetic architecture of traits through genome-wide association studies (GWAS) of reproductive outcomes (e.g. age at first birth), contributing their findings directly to journals in genetics (Barban et al. Citation2016). Demographic insights, such as those linked to population heterogeneity across cohorts (Tropf et al. Citation2017) and the importance of population representativeness (Mills and Rahal Citation2019), have potential to make vital contributions to the further growth of this interdisciplinary enterprise.

Many features of survey data have been designed to fill some of the gaps identified by demographers in the past. For example, two types of concerns often noted with the use of survey data, especially in the literature on fertility and family, have been the focus on women and the analysis of determinants solely at the individual level without consideration of wider contextual features (e.g. Watkins Citation1993; Greene and Biddlecom Citation2000). Although men’s questionnaires were incorporated into the DHS from 1987, the use of these surveys to study fertility and fertility decisions from men’s perspectives was limited (e.g. Dodoo Citation1998; Schoumaker Citation2017). Standardized surveys, such as the DHS, do not collect data on other household members, although the importance of intergenerational influence is being recognized and information collected in some survey infrastructures, such as the Gender and Generations Programme (Dykstra et al. Citation2006; Gauthier et al. Citation2018).

Although theories on fertility behaviour and transitions have emphasized the importance of social networks for social learning and information diffusion (e.g. Bongaarts and Watkins Citation1996; Bernardi and Klärner Citation2014), the data available to study these processes empirically in existing multipurpose surveys are still quite limited. Some promising attempts at analysing networks have been made in the context of more localized and specialized studies and samples, for example in Kenya (Kohler et al. Citation2001) and Malawi (Helleringer et al. Citation2009). Looking forward, digital technologies offer potential for more detailed and cost-efficient data collection on social networks and interactions within existing surveys (e.g. through measurement via mobile apps and sensors) and small-scale prototypes of such efforts already exist (see ‘New big data and digital demography’ section). In the meantime, in the absence of empirical data, efforts at exploring the effects of social networks on demographic processes have relied on agent-based simulation models, for example in the study of reproductive behaviours (e.g. Diaz et al. Citation2011; González-Bailón and Murphy Citation2013), marriage (e.g. Billari et al. Citation2007; Bijak et al. Citation2013), and migration (Entwisle et al. Citation2016; Klabunde et al. Citation2017). More broadly, agent-based modelling offers a novel opportunity to integrate social learning and macro–micro feedback mechanisms and to test the implications of micro-level theories at the macro level and integrate multiple data types (Grow and Bavel Citation2016; Kashyap and Villavicencio Citation2016; Willekens et al. Citation2017). Although the development of this kind of ‘system-based’ approach to population modelling (Courgeau et al. Citation2017) has seen momentum develop around it in the past decade, it is still not mainstream. An increasingly salient mechanism of diffusion is the use of technologies such as mobile phones and the internet. These are channels for information diffusion, social learning and feedback, and exposure to the life of others. Demographic surveys, in contrast, have been slow to incorporate information on the use of these technologies, but are now beginning to do so (e.g. Rotondi et al. Citation2020).

Civil registration and vital statistics systems

Civil registration and vital statistics (CRVS) systems remain a valuable source for demographic research on mortality and fertility in high-income countries where coverage of vital events in these systems is complete. For these countries, the launches of two online data collections—the Human Mortality Database (see HMD Citation2021) in 2002, and the Human Fertility Database (see HFD Citation2021) in 2009—have been important milestones for the provision of high-quality, internationally comparable data on vital events and demographic measures derived from them. The HMD, a collaboration between the University of California, Berkeley and the Max Planck Institute for Demographic Research, has focused on providing detailed data over long time periods, including the provision of both cohort and period measures and reliable data at advanced ages, an increasingly important consideration with continued improvements in longevity (Barbieri et al. Citation2015). Although HMD data are normally provided for annual counts, in response to the needs for monitoring the mortality impacts of the Covid-19 pandemic, the HMD team developed the Short-term Mortality Fluctuations (STMF) data series within the HMD in 2020, providing harmonized weekly mortality data for several countries. The HFD, a collaboration between the Max Planck Institute for Demographic Research and the Vienna Institute of Demography, followed on from the success of the HMD and emerged in a context of increasing interest in later childbearing and low fertility in industrialized countries (Jasilioniene et al. Citation2016). The analysis of these trends required a closer understanding of both period and cohort changes, along with changes in timing and parity-specific trends, all of which are areas where the HFD provides high-quality and cross-nationally comparable data to facilitate demographic research.

CRVS systems remain significantly underdeveloped in LMICs. A review by Mahapatra et al. (Citation2007) estimated that in the period 1995–2004, only 26 per cent of the world’s population lived in countries with complete registration of deaths and 30 per cent in countries with complete registration of births (defined as at least 90 per cent of events registered). While coverage was generally high for populations in Europe and the Americas, the worst coverage was in Africa and Asia. Improvements in civil registration systems have been slow or stagnant since the mid-1960s, although some countries with poor systems in the 1980s witnessed substantial progress in subsequent decades (e.g. Baltic states, South Korea, and Latin America including Brazil, Mexico, and El Salvador). Others more recently, in the 2000s, have shown improvements over time periods of less than a decade (e.g. Bahrain, Cyprus, Egypt, and Malaysia). These latter examples indicate that progress over shorter time spans is possible, with ‘purposeful policies’ and when ‘new ICT technologies are applied’ (Mikkelsen et al. Citation2015, p. 1405).

The deficiency of vital registration systems in LMICs has resulted in continued reliance on different strategies of ‘interim substitutes’ (Hill et al. Citation2007, p. 1726) for the estimation of mortality and fertility. The most widely used have been sample surveys such as the DHS and UNICEF’s Multiple Indicator Cluster Surveys (MICS), followed by censuses and, in some cases, sample registration systems (India, China) and HDSS sites (for small geographic areas in a handful of countries). While little change has occurred in the basic data used for the generation of fertility and child mortality indicators from retrospective birth histories, methodological refinements drawing on the use of more sophisticated statistical models for estimation have been made (Schoumaker and Hayford Citation2004; Schoumaker Citation2013; Alkema et al. Citation2014). Efforts have also been directed at addressing selection biases in the use of sibling histories for adult mortality estimation (Gakidou and King Citation2006; Obermeyer et al. Citation2010; Masquelier Citation2013). An important development in this area has been the shift towards Bayesian statistical approaches that take a more integrated and nuanced approach to the quantification of uncertainty associated with demographic indicators computed using one or multiple sources of imperfect data and modelling of trends from them (Alkema et al. Citation2012, Citation2014, Citation2016; Bijak and Bryant Citation2016; Wheldon et al. Citation2016; Alexander and Alkema Citation2018).

Sample surveys have generally been more successful at estimating child mortality and fertility than at addressing data gaps in adult mortality. This data gap has meant that other population data collection exercises conducted by NSOs (e.g. censuses) have faced increased demands and burdens, for example through the inclusion of additional modules to monitor mortality. The collection of mortality information by age and sex in censuses for a reference period preceding the census has increased considerably, particularly after the UN working group for the 2010 round of population censuses recommended including these questions for those countries without alternative sources of adult mortality. By the 2010 census round, 72 countries incorporated mortality questions (a number that included countries across Africa, Asia, and Latin America), up from 53 in 2000 and 37 in the 1990 round (Hill et al. Citation2018). In addition to adult mortality, attempts have been made to estimate maternal mortality from census data through the inclusion of questions on the timing of deaths of women of reproductive age relative to pregnancies, thereby providing a way to measure pregnancy-related deaths. The number of countries with coverage of these questions expanded from 17 in the 2000 census round to 61 in the 2010 census round (Hill et al. Citation2018). Assessments of the effectiveness of census mortality questions have shown that estimates vary by type of method used with these sources (e.g. direct or indirect) (Odimegwu et al. Citation2018). Ultimately while this approach may be cost effective, and methodological developments have improved insights gleaned from them, they are far from perfect substitutes for the development of civil registration systems and are best viewed ‘as complementary with powerful synergistic potential’ (Hill et al. Citation2007, p. 1733).

Compared with mortality and fertility, measurement of the third component of population change—migration—remains much more elusive, although progress has been made. The UN database on international migration, developed originally from research at the University of Sussex (Parsons et al. Citation2007) and subsequently extended and backdated by the UN and World Bank to 1990 (Skeldon Citation2018), provides biannual data available on international migrant stock by age, sex, and origin and destination countries and has been a significant achievement (UN Department of Economic and Social Affairs Citation2019). These tables of migrant stocks have been used to derive sequential flows (Abel and Sander Citation2014). Although the generation of data repositories is valuable, issues linked to data quality arising from different definitions of migration across states, different systems to enumerate migrants, and changing temporal dimensions of mobility all affect the interpretation of the numbers they provide (Skeldon Citation2018). The generation of cross-nationally comparable indicators of internal migration has also benefited from the improved availability of census microdata (Bell, Charles-Edwards, Kupiszewska et al. Citation2015; Bell, Charles-Edwards, Ueffing et al. Citation2015) and administrative registers linked to census data may further improve prospects for research in this area (Ernsten et al. Citation2018).

‘New’ big data and digital demography

The previous sections have outlined how, with the digital revolution, technological improvements in the past 25 years have helped to expand the capabilities of ‘old’ big population data sources, such as censuses and surveys. The spread and use of digital technologies, such as the internet, mobile phones, sensors, and cameras, as well as the increasing digitalization of different domains of social life, have themselves resulted in large volumes of new types of data on human activities, interactions, and behaviours—a category of data that has come to be known as ‘digital trace’ data.

The accumulation of digital traces has occurred as a result of two processes. First, the adoption of internet and mobile technologies implies that social life is increasingly digitally mediated. For example, mobile phones and email are commonly used for communication, web search engines (e.g. Google) are used for information-seeking, and social media platforms, (e.g. Facebook, Twitter) are used for social interaction and exchange. These are essentially digital spaces where use of and engagement with these platforms and technologies generates digital records and data streams, which are regularly captured because these data are intrinsic to the business models of the private companies that provide these services, for example, to target advertisements. Second, the digitalization of data has resulted in the storage of diverse types of information—including about non-digital or offline life—as digital records of human activities. This development has clearly benefited the creation of repositories of long-standing population data sources, including previously paper-based records such as those from historical censuses. However, vast amounts of information linked to everyday activities (e.g. image and video recording of city life, consumer transactions) are also digitally stored and accessible, as are repositories of culturally or scientifically relevant materials, such as books and scientific journals.

Recent years have seen an emerging and increasing use of digital trace data sources for research on demographic topics or using demographic approaches, paving the way for the development of digital demography (Cesare et al. Citation2018). It is important to emphasize a key shift: that a lot of this work is now being done by demographers, often in collaboration with computer or information scientists, and is targeted to an audience of demographers, as well as others working broadly in the area of computational social science. Demographers are also joining discussions with international agencies and NSOs, who are increasingly interested in population measurement from non-traditional data sources (e.g. IUSSP Citation2015; Letouzé and Jutting Citation2015). The growth of this area is signalled by the regular presence of sessions on big data at international population conferences and the convening of two International Union for the Scientific Study of Population (IUSSP) panels, the first on ‘Big Data and Population Processes’ (2015–18) followed by the panel on ‘Digital Demography’ (2018–21) (see IUSSP Citation2021a). While this work has now begun to appear in demographic journals, a lot has been published in peer-reviewed proceedings of computer science conferences, where work in the areas of social computing, computational social science, and social informatics has grown rapidly and has a longer precedent.

Defining features of digital traces

What makes digital trace data a type of ‘new’ big data for population research that is different from ‘old’ big population data sources? The first pertains to the properties of the data and the second to the data-generating process or provenance of the data. Many definitions of digital big data emphasize their novelty in terms of their properties: volume, variety and velocity (Laney Citation2001; Sagiroglu and Sinanc Citation2013; Lazer and Radford Citation2017). While volume and variety are relevant, the more notable distinctions between population data sources (such as census and surveys) and digital trace big data lie in their velocity and the fact that these data are not systematically collected for research. Salganik has characterized these data sources as ‘ready-mades’ which, in contrast to ‘custom-made’ data such as censuses and surveys, are ‘always on’ and ‘non-reactive’ (Salganik Citation2019). The fact that these data are often by-products of the use of digital platforms provides opportunity for a more dynamic or continuous measurement in real time as events occur (‘always on’), unlike that of data collection models such as decennial censuses or surveys, which involve asking questions at discrete points in time and require time for planning and fieldwork followed by additional lags for data production. Digital trace data also offer the potential to observe without asking (‘non-reactive’), a feature that is promising but also raises important challenges for the use of data sources from an ethical standpoint. It is worth noting though that the non-reactive nature of data generation is not unique to digital traces and also holds for several (non-census) administrative data sources (including population registers, tax records and electronic health records) where the data are not systematically collected for research but generated as by-products of administrative transactions.

These differences in properties and provenance imply a number of features and issues that are unique to these data and relevant when considering their applications for demographic research. The first is their usability. Unlike rectangular data frames with rows and columns, many digital trace data are unstructured, messy, and come in formats unfamiliar to many demographers (e.g. JavaScript Object Notation (JSON)). They span media such as images, text, and time- or geographically stamped records of activity (e.g. metadata associated with calls, geographical location captured by apps, or geotags associated with specific tweets or posts). Constructs or covariates to enable statistical analysis (e.g. regression) can be deduced from these data, but these constructs must first be operationalized based on what is observed, in contrast to data collection that occurs after the definition of a construct, as is usually the case in survey research (Cesare et al. Citation2018). The variety of formats, units of analyses, and also sizes of these data sets, which may contain millions of records, often require computational approaches for data management, retrieval, and analysis that are not yet a part of mainstream demographic training.

The second issue is that of bias and representativeness. Although the use of digital technologies has increased significantly, there are likely to be selection biases in who uses specific technologies, devices, or platforms, which limit the ability of these data sources to be population generalizable, a condition that has been sine qua non for the demographic enterprise. These biases are likely to be even more severe in LMICs, where despite data gaps being more significant, technological penetration remains uneven and digital divides are larger (ITU Citation2020). This limitation may help to explain why demographers have often been sceptical of digital data sources, although as discussed in the previous section, issues of selection bias or non-response bias are not absent from more traditional demographic data sources (e.g. incomplete vital registers or surveys). Indeed, issues of data quality and measurement are those that demographers have a long history of actively seeking to understand and address, and as I discuss later, emerging work by demographers working with digital traces indicates the extension of these insights to these new sources. Moreover, the lack of population representativeness of these data is in itself not a limitation per se and depends on how they are used in the context of a given research design. While selection bias may threaten the population generalizability of the estimation of levels of a quantity of interest, trend analysis is still feasible with biased data, using difference-in-difference type approaches (Zagheni and Weber Citation2015). Digital trace data, however, are affected by their own specific types of errors and bias. For example, they may be affected by algorithmic bias, whereby algorithms that are implemented on online platforms can shape behaviours, such that it may be difficult to assess whether an observed phenomenon is driven by human tendencies or algorithms (Lazer et al. Citation2014; Salganik Citation2019). A notable example of algorithmic bias is from one of the first and most famous applications of digital trace data for public health surveillance, Google Flu Trends, a platform that relied on aggregate web search queries for tracking flu outbreaks (Ginsberg et al. Citation2009). This platform was ultimately shut down in 2015, in response to the system beginning to overpredict flu prevalence systematically. A key factor underlying this was that flu-related searches increased significantly due to changes in Google Search’s recommendation algorithm that promoted related search keywords to users (Lazer et al. Citation2014). A related issue to changing user behaviours is changing user composition—for example, given a landscape of changing service providers, migration between platforms may occur such that observed correlations from a given period may not persist.

Third, these data often come from and are owned by private companies, which has implications for their access, for the information available about them due to the often-restricted nature of proprietary algorithms that shape them, and by extension for issues such as research reproducibility. An example of this in the context of demographic research is provided by the Facebook advertisement platform, which consists essentially of marketing data about Facebook audience counts (users) provided to potential advertisers. These data have been repurposed for demographic studies of migration (e.g. Zagheni et al. Citation2017; Alexander et al. Citation2020; Rampazzo et al. Citation2021) based on the promising correlations observed between Facebook ‘expats’ and estimates of migrant stocks measured in large population surveys. While general descriptions of the expat category are provided by Facebook, the individual attributes determining whether a user is labelled an expat—whether based on geolocation or individual profile information, for example—remain unknown. Further, the numbers of expats as defined by Facebook also dropped significantly from March 2019 with little warning, likely due to changes in the underlying algorithm (Rampazzo et al. Citation2021). More broadly, the landscape of access to and use of these sources from a research perspective is unpredictable and uneven, and has arguably become even more restrictive in recent years in the face of legal changes, such as the introduction of the General Data Protection Regulation (GDPR) in Europe and privacy concerns (Freelon Citation2018; Bruns Citation2019). While some companies make their data (e.g. Twitter) or some form of aggregated data (e.g. Google Trends) available widely as per their terms of use, others have policies to restrict use to specific cases. Developing more democratic modes of access to these data sources for research—which focus on privacy preservation while also recognizing the data’s scientific value, and which move beyond ad-hoc, private non-disclosure agreements—is desirable, but large-scale prototypes of such mechanisms for access do not exist for the moment. While some prototypes are emerging (e.g. the Open Algorithms Project (OPAL Citation2021) that seeks to provide mobile phone data to researchers) and the Covid-19 pandemic accelerated the development of several initiatives for data sharing between private companies and university partners (e.g. Facebook’s Covid-19 Symptom Survey and other Facebook ‘Data for Good’ outputs (Facebook Citation2021); also the COVID-Citation19 Mobility Data Network (Citation2021)), the development of frameworks for wider data sharing and even co-production is still much needed.

A fourth issue linked to the use of digital traces is that they are generated by users of specific services or technologies, who have not provided informed consent for the purposes of research. The absence of informed consent, a fundamental principle of survey research, in the generation of these data raises important ethical issues for researchers to consider when using them. On these grounds, it is argued that standards of privacy protection applied to them should be higher than those applied to data collected within the parameters of informed consent (Oberski and Kreuter Citation2020).

Demographic research with digital trace data

Two broad types of uses of digital trace data have emerged in the area of digital demography. The first strand of work has examined how digital trace data can be repurposed for measuring population indicators and processes, as well as understanding the contexts of demographic behaviours, particularly in domains where there are gaps in traditional sources of demographic data or measurement may be difficult. Given the challenges of migration measurement, considerable work in this vein has explored the potential of digital traces generated from the internet for estimating migration patterns. Nevertheless, examples of studies on mortality and fertility have also emerged. Crowdsourced online genealogies have been used for analysing longevity, including research on intergenerational correlations (Fire and Elovici Citation2015; Kaplanis et al. Citation2018). Aggregate Google Search queries conceptualized to proxy birth intentions have been used to predict fertility patterns (e.g. Billari et al. Citation2016; Wilde et al. Citation2020) and estimates of new parents on Facebook used to predict men’s fertility (Rampazzo et al. Citation2018).

For migration, work was started by Zagheni and Weber, who used data on IP geolocation of a large population of Yahoo email users to estimate trends in international migration rates (Zagheni and Weber Citation2012). Other efforts have drawn on professional histories on networking sites—such as LinkedIn (State et al. Citation2014), bibliometric databases of scholarly publications such as Web of Science (Aref et al. Citation2019), air traffic flows (Gabrielli et al. Citation2019), social media such as Twitter (Fiorio et al. Citation2017; Yildiz et al. Citation2017), and Facebook’s advertising platform (Zagheni et al. Citation2017; Alexander et al. Citation2019; Rampazzo et al. Citation2021)—to generate measures of flows and stocks of international or internal migrants and, in some cases, to integrate analyses of internal and international migration. While much of this work has been motivated by a need to fill data gaps in conventional sources, there have also been efforts to explore theoretical ideas by leveraging some of the flexible properties of digital traces. For example, Fiorio et al. Citation2021 explored how (re)defining temporal or geographical intervals of measurement can affect measured migration intensities. Aside from internet data, changes in the spatio-temporal distribution of mobile phone users detected using timestamped call detail records have been used to map population mobility, to capture seasonal internal mobility patterns that would otherwise be missed in conventional ‘slower’ data sources such as censuses (Blumenstock Citation2012; Deville et al. Citation2014; Lai et al. Citation2019), and to identify mobility changes in response to a crisis (Wilson et al. Citation2016). In contrast to the studies based on internet data sources, which largely focus on high-income settings, studies using mobile phones are more likely to cover LMICs, given the wider diffusion of mobile phones relative to the internet. The Covid-19 pandemic has further illustrated the value of using mobile phones for monitoring mobility changes as a broader tool for pandemic response strategies (Grantz et al. Citation2020; Oliver et al. Citation2020) and has seen a renewed call for more open access to aggregated mobility data from mobile phones (Buckee et al. Citation2020).

A distinct feature of demographic work using digital trace data has been its attention to examining the viability of these sources for population-generalizable measurement (e.g. Yildiz et al. Citation2017; Zagheni et al. Citation2017; Alexander et al. Citation2020). This has involved modelling and understanding biases in the quantities computed using different digital platforms by calibrating them against ‘ground truth’ measures, for example, derived from large, representative surveys or censuses. This type of approach, where parameters are estimated by linking faulty or imperfect data to more reliable data through a model, is similar to the logic of model life tables, which have been widely used to improve coverage of mortality estimates in low-income countries with deficient vital registration (Coale and Trussell Citation1996). In a similar way, some studies have provided examples of how digital trace data could improve country coverage of indicators linked, for example, to men’s fertility (Rampazzo et al. Citation2018) or gender inequality (Kashyap et al. Citation2020) in LMICs, by estimating models that link signals from social media to ground-truth indicators. Data on consumer transactions (Longley et al. Citation2018; Lansley et al. Citation2019) and satellite images (Lloyd et al. Citation2017; Leasure et al. Citation2020) can provide finer spatial resolution to subnational population estimates when combined with census and administrative data sets. More commonly, though, studies have articulated the value of digital traces in terms of their higher-frequency temporal resolution for nowcasting current patterns of demographic indicators before these appear in official statistics (e.g. Billari et al. Citation2016; Fiorio et al. Citation2017; Kashyap et al. Citation2017; Alexander et al. Citation2020; Wilde et al. Citation2020) or for monitoring changes during crises when traditional forms of data collection may be infeasible (e.g. Alexander et al. Citation2019; Palotti et al. Citation2020). A number of nowcasting efforts have found that the best predictions are generated by combining signals derived from digital traces with sources such as large-scale surveys and administrative data sets, emphasizing the former’s value as complements, not substitutes. This has paved the way for efforts developing statistical frameworks that integrate both types of sources and quantify uncertainty, capitalizing on their relative strengths and weaknesses (e.g. Lansley et al. Citation2019; Alexander et al. Citation2020; Rampazzo et al. Citation2021).

Digital traces of different types can provide complementary measurement of contexts and environments, expressions of identities and sentiments, and information-seeking behaviours that are relevant for understanding demographic behaviours and outcomes. In cases where there may be social desirability biases, for example in the case of abortion (Reis and Brownstein Citation2010; Leone et al. Citation2021) or sex-selective abortion (Kashyap et al. Citation2017), data from aggregated Google searches have been shown to capture information-seeking behaviours that might not be readily measured in surveys. Analyses of sentiment around parenthood, as expressed by what people tweet on social media, might shed light on normative responses but also on concerns and contexts surrounding parenthood (Mencarini et al. Citation2019). Photos have provided a novel opportunity to analyse interracial friendships (Berry Citation2006) or, in other cases, an opportunity to improve age measurement in contexts where age reporting is missing or faulty (Helleringer et al. Citation2019). Images of cars in a neighbourhood captured by Google Street View have also been used to infer the socio-economic characteristics of US cities (Gebru et al. Citation2017).

While the examples so far refer to cases where digital traces already exist and require repurposing, a hybrid approach, where digital traces are actively collected by researchers within survey instruments, provides a promising strategy for linkage of different types of data (Stier et al. Citation2020) that could also be fruitfully applied for demographic research. Palmer et al. (Citation2013) provided an example of how mobile phones can be used to collect survey data with location-based information, thereby examining individuals in their activity spaces in a more dynamic way that is more aligned with how they actually experience space, rather than in discrete census blocks. Other examples that could be applied to demographic research include the use of sensors to capture social interactions and networks by measuring spatial proximity (Cattuto et al. Citation2010; Kiti et al. Citation2016) and the use of metadata from call records (Kreuter et al. Citation2020). In addition to overcoming measurement challenges linked to recall bias in self-reports, these passive data collection approaches could also help ameliorate respondent burden—a growing issue with increasingly lengthy survey instruments—while providing avenues to obtain informed consent.

The second strand of work using digital trace data has sought to apply demographic approaches to study who is online or to understand demographic behaviours (e.g. dating and mate search) in digital spaces. This work is motivated by the fact that the digital revolution has itself created new online populations. Understanding the demographic composition and biases of digitally connected populations globally is likely to become increasingly important, given continued interest in using internet and mobile technologies for either passive observation or active recruitment (e.g. for survey data collection). While the use of online panels and social media recruitment for surveys in high-income countries has increased significantly due to their cost efficiency (Groves Citation2011; Schneider and Harknett Citation2019; Kalimeri et al. Citation2020), there are also emerging examples of data collection via mobile phones and social media in LMICs (e.g. Tamgno et al. Citation2013; Diamond-Smith et al. Citation2020; Coffey et al. Citation2021). This trend towards leveraging mobile, internet, and social media technologies for data collection significantly accelerated during the Covid-19 pandemic, when rapid data collection was clearly needed but traditional face-to-face approaches for collection infeasible (e.g. Adjiwanou et al. Citation2020; Grow et al. Citation2020; Battiston et al. Citation2021; Feehan and Mahmud Citation2021), stimulating wider discussions among population scientists about such approaches (e.g. IUSSP Citation2021b). Population-generalizable uses of these approaches require a deeper understanding of who is using these technologies or specific platforms. For example, significant gender inequalities in internet and mobile phone access exist in South Asia and sub-Saharan Africa, but many social surveys provide limited data on information technology use by sex and other characteristics (Fatehkia et al. Citation2018). Understanding demographic differentials in the use of specific social media platforms may require us to turn to digital traces (Gil-Clavel and Zagheni Citation2019) or collect data from specific platforms (e.g. Facebook or Google) and then generalize to a wider population of internet users, using models (Fatehkia et al. Citation2018; Feehan and Cobb Citation2019; Kashyap et al. Citation2020). For example, Feehan and Cobb highlighted how network sampling approaches—similar to indirect mortality estimation approaches that involve asking others, such as with sibling histories—can also be used to deduce internet adoption from a limited online sample.

The increasing digitalization of different life domains implies that digital inequalities also have implications for demographic processes and social inequalities. Moreover, studies of online life can illuminate whether inequalities from the offline world are reinforced or transformed online. They may also provide a lens to study social interactions and behaviours that were previously difficult to study. An example of this is provided by the literature on internet dating. Although demographers have often written about the marriage market, they have rarely observed mate search dynamics. Preferences are often inferred from outcomes (e.g. marriage rates) but never directly observed. Studies of online dating provide a unique opportunity to understand these behaviours (e.g. Potârcă and Mills Citation2015; Bruch and Newman Citation2018).

Demography and the data revolution: Emerging opportunities and challenges

To conclude, I return to the question posed at the beginning: has demography witnessed a data revolution? Yes. Over the past 25 years, the data ecosystem for demographic research has been significantly enriched, as demonstrated by the augmented granularity and new features available for ‘old’ big population data sources (such as censuses, administrative data, and surveys) and the growing use of ‘new’ big data sources. This is not to say that progress has been even, and a vision of the data revolution that strives to leave no one ‘invisible’ (IAEG Citation2014) must confront the striking inequalities that persist between the data-rich Global North and data-poor Global South. Bridging these gaps requires financial, political, and intellectual investments towards developing robust data infrastructures for different types of data (including censuses, administrative data, vital statistics, and new forms of data), as well as sustained efforts to strengthen scientific training and capacity in LMICs to sustain the development of long-term studies. While methodological ingenuity through the use of statistical, including Bayesian, models has helped to extract value from the data that are available, these approaches have starkly illuminated the high levels of uncertainty that affect population estimation and vital rates based on deficient data.

Although new data opportunities have helped to address some lingering concerns, they have also raised new ones. First, the expansion in volume, variety, and opportunity for linkage across multiple different types of data has occurred against a backdrop of growing concern about non-response, greater threat of respondent reidentification, and need for privacy protection, because with more detailed data also comes the potential for misuse. Moving forward, maintaining sensitivity and foresight towards the potential for data misuse, while mitigating the scientific loss incurred by missed use, will become an increasingly necessary balancing act, where population scientists will need to make an active contribution to public debates to articulate the trade-offs. These tensions have already come to the forefront in the charged debates surrounding the implementation of differential privacy (DP) as a method of disclosure control in the 2020 US Census (Ruggles et al. Citation2019). The introduction of statistical noise entailed by DP to prevent respondent reidentification threatens to have an adverse effect on block-level estimates and population denominators for the calculation of disaggregated rates derived from census microdata (Santos-Lozada et al. Citation2020; Hauer and Santos-Lozada Citation2021) and could significantly compromise the scientific and policy impacts that more granular data have achieved.

Second, the different kinds of data sources enabled by the data revolution have provided momentum for both the macro-level discovery stage of demographic research that aims to describe empirical regularities in populations and the explanation stage that seeks to understand why and how they occur (Billari Citation2015), but more work is needed to integrate the two approaches. The availability of integrated cross-national databases for demographic research has sustained important work that focuses on good population-level description. There may indeed be even more opportunity to revitalize the discovery of empirical regularities in population-level dynamics by drawing on machine learning techniques applied to census microdata and large-scale surveys (e.g. De Maria et al. Citation2019; Hauer and Bohon Citation2020; Salganik et al. Citation2020) or to discover new regularities in demographic processes when observed at finer temporal or spatial resolutions, as provided through digital traces (e.g. Fiorio et al. Citation2021). This macro-level orientation has coexisted with research focused on explaining individual-level variations and understanding pathways, in which a greater emphasis and understanding of issues of selection, heterogeneity, and causal interpretation is also now visible. There is clearly a role for thinking more expansively about causality in demography and for considering methods of demographic accounting, including techniques such as standardization and decomposition focused on the macro level as types of causal analysis, as argued by Bhrolcháin and Dyson (Citation2007). These approaches may yield new insights when applied to different online populations in the context of digital demography, for example to understand differences between platforms that are attributable to behaviours vs demographic compositions (e.g. Cesare et al. Citation2018). The deeper combination of different types of data including both the new and old, as well as the development of approaches such as empirically informed agent-based and microsimulation models that conceptualize populations as systems, offer promising opportunities for the integration of micro and macro levels of analysis.

Third, and linked to the second issue, we need to rethink which methods are used with the data opportunities we now have. With the increasing use of individual-level survey data over the past five decades, demographers have come to rely extensively on the tools of inferential statistics, which were developed with small, not large, data sets in mind. Bohon has described this in terms of a culture in which demographers have been collecting and analysing ‘big data in a small way’ (Bohon Citation2018, p. 323). Discussions on de-emphasizing p-values in scientific analyses within the broader research community are highly relevant for demographers, as our data sets are already large and have become larger. A recognition of these issues within the demographic community has emerged clearly, as shown, for example, by the stance adopted by the editorial board of Demographic Research on p-values (Bijak Citation2019). We need to move further towards emphasizing different aspects of our models, for example the size of effects, and towards a deeper assessment of the magnitude or meaning of these effects. It will also be increasingly important to recall that bigger data with richer features do not necessarily equate to better or unbiased data. While these issues have been discussed in relation to ongoing ‘new’ big data sources, for which demographic techniques for assessing data quality and examining biases have the potential to be reconfigured, these lessons should not be forgotten with more traditional data sources either. In any case, demographers are uniquely positioned to make a vital contribution to how ‘new’ big data sources can be used for careful population-generalizable measurement and to play a pivotal role in the broader development of the field of computational social science. Demographic insights linked to understanding and validating population representativeness are also salient for the careful integration of biological measures within surveys, for example polygenic scores drawing on genome-wide association studies (Mills and Tropf Citation2020). The data revolution, with its incumbent data opportunities, has forged the pathway for interdisciplinary research at the interface of demography with other disciplines, whether biology, economics, behavioural science, or computer/information science, but there is still much more room to grow.

Fourth, a key advantage of demographic research is that it often draws on data that are available in the public domain and accessible through web-based repositories. This is important from the perspectives of researcher access and reproducibility, both areas of increasing scientific discussion. This mode of access means that some of the concerns about reproducibility of findings (e.g. the ‘replication crisis’ in psychology)—due to data that are not shared or in the public domain, or, as with some emerging forms of big data, arising from restricted or protected data from companies—are less immediately pertinent within the field. However, as demographers come to rely more on more detailed or restricted-access data or to draw on data from commercial providers, we will need to generate solutions for maintaining open, transparent models of research, while aligning with regulations on confidentiality and licencing. The use of pre-analysis or preregistration plans, which have come to be used in experimental research in fields including psychology and political science, could offer a model for greater transparency, although the extension of this model for the analysis of secondary observational data sets is unclear. Standards for research reproducibility are still evolving, however, and although positive signs of change are visible (e.g. inclusion of code with publications), open questions remain about how best to develop these approaches, given the changing data ecosystem and also in light of the non-trivial computational power (and technical skill) that replicating analyses increasingly demands.

Capitalizing on the opportunities presented by the data revolution requires that we retain some of the old, while adapting flexibly to incorporate the new. Some of the new will require (re)training in methods of data collection, management, and analyses, but also, importantly, data ethics. It will also require us to draw on theories and ideas from other disciplines. While the data revolution in demography has clearly started, it has only just begun.

Notes

1 Ridhi Kashyap is based in the Department of Sociology, Nuffield College, and the Leverhulme Centre for Demographic Science, all at the University of Oxford.

2 Please direct all correspondence to Ridhi Kashyap, Nuffield College, University of Oxford, New Road, Oxford OX1 1NF; or by Email: [email protected]

3 Funding: Leverhulme Trust, Leverhulme Centre for Demographic Science.

4 I am grateful to John Casterline, Hannaliis Jaadla, Rebecca Sear, Wendy Sigle, and two anonymous reviewers for their helpful comments and feedback on the paper. Liliana Andriano, Jennifer Beam Dowd, Julia Behrman, and Albert Esteve generously shared their expertise and pointed me to useful references. Hampton Gaddy provided excellent research assistance for the section on surveys and the DHS.

References