1,046
Views
1
CrossRef citations to date
0
Altmetric
Research Articles

Using born-digital archives for business history: EMCODIST and the case of E-mail

, , , &
Pages 16-23 | Received 22 Dec 2022, Accepted 17 Feb 2023, Published online: 26 Feb 2023

ABSTRACT

Historians of business and management increasingly conduct research in digital archives. This article reviews some of the challenges and opportunities associated with the use of born-digital archives. As an example, we focus on scholarly use of large-scale organizational e-mail collections. In addition to allowing researchers to answer traditional questions about innovation, strategy and organizational development, e-mail also permits more granular investigation of new questions, such as those relating to the timing and flow of information inside organizational networks. Knowledge discovery in e-mail requires new search tools. We describe EMCODIST, a prototype tool that we have developed to support search and discovery in e-mail. Scholars interested in learning more are directed to a grant-funded website where versions of the EMCODIST tool support different types of searches.

This article is part of the following collections:
Methods and Madness in Management and Organizational History

Introduction

Historians of business and management increasingly conduct research in digital archives. Prior research has led to several taxonomies of digital archives (Nix and Decker Citation2021; Nix et al. Citation2023). Regardless of whether the archival content at issue is born-digital or has been digitized by an established archive, access to and use of these source materials challenge traditional historical research practices and require new approaches. Historians have been talking about the uncertain future of digital archives for many years, but in many settings and for many questions, that future has now arrived (Harvey and Jones Citation1990; Rosenzweig Citation2003).

Building upon work that the authors have conducted over several years, this brief article introduces business historians to the subfield of e-mail archives, a type of digital artifact that holds particular promise for business historians and other organization-oriented researchers (Decker et al. Citation2022; Nix et al. Citation2023). E-mail archives are born digital in the sense that they preserve artifacts that were created digitally. In the language of Nix and Decker (Citation2021), e-mail archives are ‘intrinsically digital;’ it is possible to imagine a paper archive of one or ten million printed e-mail messages, but in that format, such a collection would have limited value.

We also include an overview of EMCODIST, an experimental, context-sensitive search tool that the authors have developed to support content discovery in complex organizational e-mail corpora (Kuppili Venkata et al. Citation2021).

Digital archives

Almost all institutional archives now offer some form of remote, digital access, and many traditional social science data repositories can be productively accessed by historians too. Although there will always be a subset of sources that cannot be accessed remotely (i.e. ancient manuscripts and other fragile or proprietary materials or informal sources that do not meet a threshold for digitization), the typical business history study can now proceed from the assumption that at least some relevant digital records will be available.

Since at least the turn of the 21st century, business activities have generated digital traces. However, the digital enterprise does not necessarily generate digital archives. Far from it. Scholars should not assume that every phone transcript, Slack channel, gChat, e-mail, or other digital trace has been preserved. As has been the case throughout history, most such traces have not been and will not be preserved, and there is some reason to believe that recent trends have led to weaker persistence of the ‘record of business’ (Kirsch Citation2009). Even well-known digital archives like the Internet Archive’s ‘Wayback Machine’ only capture a subset of publicly available digital materials produced by actors who choose to let their website be sampled by the Internet Archive’s crawlers. Short-lived, private or non-crawlable content is not generally preserved and will only become available for scholars if organizations ‘opt in’ to an active preservation regime. Therefore, as a level-setting exercise, the prospective business historian should not expect an archive to accession the full range of an organization’s history-relevant digital records. Rather, as with any pre-digital historical project, new questions will continue to drive efforts to acquire and interpret new sources, and the prospective researcher should expect that considerable time and energy will be required to identify and access desired and relevant digital archival materials.

To orient scholars to the range of issues facing a business historian seeking to conduct primary research in digital archives, we first review some of the general issues associated with born-digital archives. However, the bulk of the article focuses upon scholarly use of large-scale e-mail corpora, as these reflect some of the most cutting-edge opportunities and challenges for the field. Readers interested in learning more about other forms of digital archives are encouraged to consult Nix et al. (Citation2023) which presents a broader taxonomy of digital repositories.

Opportunities and challenges associated with “born-digital” archival sources

There are several features of born digital archives that offer enhanced value to the scholarly user. First, born digital collections can be more inclusive, more representative and more exhaustive than a collection assembled after the fact from non-digital materials. Sampling biases are reduced in a born digital collection. In a large-scale e-mail corpus, for instance, messages from senior leadership and staff are equally likely to be retained, and their respective records can be preserved and studied in parallel in the same collection. In addition, such corpora usually have not been pre-selected either to include particularly positive materials or to exclude materials that might present the organization in a negative light. Researchers can see, for example, that the Enron e-mail corpus – though incomplete and small relative to the total number of e-mails generated inside such a large, diversified company – contains a wide range of different types of materials, including casual e-mail threads, unfiltered newsletters and reports, as well as evidence of consequential management decisions (Benke Citation2018).

Second, born digital materials often retain metadata – data about the provenance and subsequent use of a digital artifact – that a scholar can use to learn more about individuals and the organizations in which they were situated. E-mails, for example, have timestamps, subject lines, and ‘cc’ and ‘bcc’ fields from which a scholar can construct network maps to interpret patterns of interaction. Multiple versions of important documents may be compared to each other to look for how a contract or a marketing plan was updated or shared over time. Finally, born-digital archival materials can usually be searched with text (and increasingly image) searching tools that are much more precise than an archival index, a finding aid or even a concordance. Taken together, when born-digital materials are preserved and available, they can allow historians to ask and answer new questions. In this digital historical future, we can, for example, ask more microhistorical questions, as e-mails capture the mundane alongside the meaningful. We can also address issues of exact timing and sequence through accurate time stamps. Research questions regarding who read a piece of information, as well as how that information subsequently traveled through an organization, can also be investigated at a more granular level than was possible in a pre-digital archive.

Not surprisingly, born digital archives also pose challenges for researchers. As noted above, even when born digital materials have been preserved, access often requires special permissions and tools. The scale, for instance, of even a modest-sized, organizational e-mail collection (e.g. 1 million messages) far exceeds a single scholar’s ability to read the entire collection. Moreover, even if there existed a super-researcher who committed to read an entire e-mail collection, the average e-mail message is probably not useful. In practice, discovery – the process by which a scholar learns from an archival collection or object – is mediated by some of the same digital attributes that characterize a given born digital collection. Good metadata allow interesting visualizations, but may also obscure the ability of the scholar to interpret text. As we describe in more detail below, the tools by which the historian interacts with a digital archive inevitably shapes both the research experience and the resulting interpretations.

Research process differs between traditional and digital archives

Under the traditional archival research process, the historian relies upon professional archive staff to identify, accession and process important collections and then make them available for scholarly access. A processed collection is organized based upon the context knowledge and normative goals of the archivist(s) who deemed it worthwhile to acquire the collection. Usually, this process also includes the production of a ‘finding aid’ or other guide to the collection. The finding aid orients the researcher to the collection, identifies key individuals and other named entities, describes the provenance of the materials (including cross-references to other, relevant collections), and provides a ‘folder-level’ description of the contents of the archive that allows a researcher to select the subset of the archive that is relevant to the researcher’s interests. For instance, a visitor to the Hagley Library may review multiple finding aids to determine which collections are even worth examining.Footnote1 Once immersed in a subset of a given collection, the researcher will often read in an iterative process, close reading certain folders and materials and coming up with new questions that point to additional folders. This iterative process can take months, occurs across multiple in-person visits, and constitutes a core component of historical interpretation.

By contrast, a researcher accessing a digital archive faces a very different search and discovery challenge. Even assuming that the collection has been accessioned into a formal archive, the digital or electronic version of the traditional finding aid does not yet exist as a stable genre. Where the source materials themselves exist in digital form – either because they were digitized or were born-digital – direct, remote access to archival content may be possible. However, because digital collections differ from paper-based ones (for reasons and in ways set forth above), no standardized discovery protocol has yet been established. Sometimes, a researcher will be able to gather contextual knowledge about a given setting from third-parties such as news articles and other informal sources; but even having some contextual knowledge may prove insufficient to guide the researcher to relevant and responsive source materials.

Scholars working in the digital-only context face an added challenge that we call ‘the blank search window’ problem. A scholar navigating to the local search window of a digital archive is accustomed to searching in a general search window like Google. Search behavior on the open web is known to be subject to particular biases, in part due to deference to the search engine (Lorigo et al. Citation2008). As scholars, we bring these biases to the problem of local discovery in digital archives, compounding problems inherent to the research and knowledge discovery process. Not only do we not know what we are looking for; due to the constraints of individual digital archives, we do not know if the information-we-cannot-name is present in the collection in which we are searching. Many repositories try to mitigate this problem by providing access to pre-selected sets of archival materials. For instance, the Edison papers website has links to ‘item sets’ that reflect the archivists’ efforts to categorize the underlying digital materials.Footnote2 However, by definition, new historical questions come from researchers and cannot be anticipated by even the most skilled archivist.

E-mail as an emergent research context

As suggested above, e-mail is a unique type of born digital archival resource that typifies both the opportunities and challenges associated with doing research in born digital settings. E-mail as a form of communication is both highly personal – even within organizational contexts – and surprisingly messy (Prom, Citation2018). Several attributes of e-mail stand out. First, in prior work, we have observed that e-mail is a hybrid artifact; on the one hand it describes a single e-mail message between correspondents, but it also refers to a collection of e-mail messages exchanged among a group of correspondents that might include, for instance, senior managers, coworkers, customers, investors, and even personal friends and family who might randomly appear in the e-mail mailbox of a given e-mail user. In this way, we like to say that e-mail ‘is’ and e-mail ‘are’ (Decker et al. Citation2022). In addition, an e-mail corpus is both content and context. Individual messages are exchanged between people, but an organizational e-mail corpus also often includes accounts for functional departments and information about other business-level activities that can help the scholar learn about the setting in which communication was occurring. This hybrid nature of e-mail underscores the challenge of using it as a historical resource.

Second, as with other types of sources, scholars who study e-mail usually engage with it to answer a particular research question. To date, management researchers have used e-mail collections to study topics as varied as informal social networks (Aven Citation2015; Jacobs and Watts Citation2021), organizational timing norms (Byun and Kirsch Citation2021), and intraorganizational niches (Liu et al., Citation2016). Meanwhile, business historians like Nix et al. (Citation2021) and Benke (Citation2018) have recently used similar collections to develop historical contributions. Specifically, both studies use the well-known and publicly available Enron e-mail corpus to provide different historical accounts of the company. Using born-digital sources was necessary here as the latter years of the company’s existence was characterized by normalization of e-mail as the default mode of written organizational communication. In this sense, though this period is only just coming into the arc of historical interest and many e-mail archives remain closed to researchers, e-mail represent an example of the type of data available for research into the post-analogue era (Jaillant Citation2019; Nix and Decker Citation2021).

To date, research use of large-scale e-mail corpora remains limited. Few collections have been opened to the public, and those that have been accessed exist as archival sources for idiosyncratic reasons. In this sense, access to e-mail is likely to remain – for the moment at least – the exception rather than the rule, resulting from events like massive failure in the case of Enron; venture workout in the case of AuroraTec, or personal permissions acquired by researchers on case-by-case basis.Footnote3 Nonetheless, there is a significant role for researchers in shaping the landscape of access to e-mail archives, and the process will likely demand that researchers play a more active role in the archival process. Indeed, Citation2022) argue that for digital archives, trust, collaboration between researchers and archivists and other stakeholders is vital for meaningful and sustainable access.

EMCODIST (E-Mail COntext DIScovery Tool)Footnote4

Given the challenges associated with using born digital archives – and to help historians and other scholars interested in knowledge discovery in e-mail – we have developed EMCODIST (E-Mail COntext DIScovery Tool), a novel, context-sensitive discovery tool. EMCODIST aims to assist scholars asking both traditional and novel research questions: how did individuals within the organization deal with internal conflict, strategic change and the rapid advance of technology? At a granular level, who was communicating with whom in which time frames? Can we identity specific decisions and other turning points using a combination of technical and interpretive approaches? Notwithstanding the risk of user deference to the search engine, EMCODIST utilizes an empty search window where a user can submit their own search terms to retrieve responsive e-mail content.

EMCODIST has two distinct AI-based models: EMCODIST Basic and EMCODIST Plus. EMCODIST Basic is a simple, phrase-based search model that matches the phrases in the query to the content of the e-mails (Kuppili Venkata et al. Citation2021). The keywords and phrases from the query are identified with the help of NLP, and the model returns all e-mails and threads that contain the phrase as a single unit. This model is an improvement over basic keyword search, but requires users to know what they are looking for. For instance, if a researcher knows that certain individuals were involved in a given decision or that a transaction of interest occurred on a specific date, EMCODIST Basic can find all e-mails sent or received by an individual or sent within a fixed time window.

EMCODIST Plus makes use of BERT embeddings to enable concept-based search.Footnote5 EMCODIST Plus finds similarities between the topics discussed in the corpus and user queries and ranks e-mails according to their relevance to the concepts in the query. The Plus model is technically more complex than EMCODIST Basic; using BERT embeddings allows the tool to better understand the meaning of words in context of the neighboring words in any given e-mail or thread. In this way, EMCODIST Plus may prove more useful to scholars who are not sure exactly what they are looking for, either because they are new to the collection or are uncertain if the collection holds materials that are responsive to their interests. If a researcher suspects that an e-mail collection might hold responsive materials, but knows little about the specific individuals and entities or when the events in question took place, EMCODIST Plus may be a better starting place than EMCODIST Basic because it will return e-mails that are conceptually related to the searched concepts, even if the exact keywords and phrases do not appear in the message.

In practice, we anticipate that researchers will probably need to search iteratively, switching back and forth between the two models. Each offers different features which, when combined, will allow the researcher to achieve a desired balance of precision and recall. Depending upon the specific research question, a combination of these two search modalities can not only help scholars adapt their practice to accommodate born-digital sources, but also provides novel historical insights into organizations that only e-mail can deliver.

Conclusion

This brief article has reviewed the challenges and opportunities associated with conducting historical research in born-digital archives. Given our own recent interests, we have focused on the development of e-mail archives, with further emphasis upon EMCODIST, a foundation-supported prototype tool that helps researchers discover new information when searching large, organizational e-mail corpora.

Assuming that more collections are built from born-digital content like large-scale e-mail corpora, it is reasonable to conclude that future scholars will have access to an expanding set of digital resources within which they will be able to conduct business history research. We have laid out some of the types of questions that such materials may be able to help historians answer; however, we see no easy path to resolve the paradox of the blank search window. ‘Item sets’ and other pre-curated subsets of sources will not alter the nature of the scholarly enterprise which pushes us to ask new questions and explore new contexts. In such settings, by definition, context knowledge is lacking. Because discovery will always initially take place in the absence of deep contextual knowledge, scholars seeking to interpret born-digital collections will need to use discovery tools to establish beachheads of knowledge, and it would be unwise to expect a single tool or technical package to provide a one-size-fits-all solution. Our own efforts have highlighted the challenges arising from dealing with e-mail, a particularly ubiquitous and idiosyncratic type of complex digital artifact.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

David A. Kirsch

David A. Kirsch is Associate Professor at the Robert H. Smith School of Business and the College of Information Studies (i-School) at the University of Maryland, College Park. His research focuses on the intersection of problems of innovation and entrepreneurship, technological and business failure, and industry emergence and evolution.

Stephanie Decker

Stephanie Decker is Professor of Strategy at the University of Birmingham Business School, UK, and Visiting Profession in African Business History at the University of Gothenburg, Sweden. She is joint editor-in-chief of Business History and co-Vice-Chair Research & Publications at the British Academy of Management. Her research focuses on historical approaches in management research.

Adam Nix

Adam Nix is a Lecturer in Responsible Business at the University of Birmingham and an Associate of the Lloyd’s Centre for Responsible Business. His research interests include historical research into digital-era business and understanding wrongdoing and irresponsibility as organizational phenomena.

Shubhangkar Girish Jain

Shubhangkar Girish Jain graduated with a Masters degree in Information Systems from the University of Maryland, College Park in 2022.

Santhilatha Kuppili Venkata

Santhilata Kuppili Venkata is an academic researcher and Lead Data Scientist at Animal friends insurance company, UK. She researches digital archives’ preservation and access methods and wrote the EMCODIST tool.

Notes

1. Many finding aids are searchable online; see https://www.hagley.org/research/search-hagley-collections.

3. Illustrating the challenge of scholarly access to e-mail, AuroraTec is a pseudonym; see https://dotcomarchive.bristol.ac.uk.

4. Thanks to the support of EA:BCC grant program, as of Fall 2022, the EMCODIST prototype is available here: https://emcodist.com. The source code jupyter notebooks have been made available on Github and can be accessed by visiting this link https://github.com/Contextualising-Email-Archives/discovery-tool.

5. Released by Google AI in 2018, BERT (Bidirectional Encoder Representations from Transformers) is a large-scale, pre-trained language model that is used by developers and information retrieval researchers to build custom language models.

References

  • Aven, B. L. 2015. “The Paradox of Corrupt Networks: An Analysis of Organizational Crime at Enron.” Organization Science 26 (4): 980–996. doi:10.1287/orsc.2015.0983.
  • Benke, G. 2018. Risk and Ruin: Enron and the Culture of American Capitalism. Philadelphia, PA: University of Pennsylvania Press.
  • Byun, H., and D. A. Kirsch. 2021. “The Morning Inbox Problem: Email Reply Priorities and Organizational Timing Norms.” Academy of Management Discoveries 7 (2): 180–202. doi:10.5465/amd.2018.0210.
  • Decker, S., D. A. Kirsch, S. Kuppili Venkata, and A. Nix. 2022. “Finding Light in Dark Archives: Using AI to Connect Context and Content in Email.” AI & SOCIETY 37 (3): 859–872. doi:10.1007/s00146-021-01369-9.
  • Harvey, C., and G. Jones. 1990. Business History in Britain into the 1990s. Business History, 32(No.1): 1–16.
  • Jacobs, A. Z., and D. J. Watts. 2021. “A Large-Scale Comparative Study of Informal Social Networks in Firms.” Management Science 67 (9): 5489–5509. doi:10.1287/mnsc.2021.3997.
  • Jaillant, L. 2019. “After the Digital Revolution: Working with Emails and Born-Digital Records in Literary and publishers’ Archives.” Archives and Manuscripts 47 (3, Sep 2): 285–304. doi:10.1080/01576895.2019.1640555.
  • Kirsch, D. A. 2009. “The Record of Business and the Future of Business History: Establishing a Public Interest in Private Business Records.” Library Trends 57 (3): 352–370. Project MUSE. doi:10.1353/lib.0.0041.
  • Kuppili Venkata, S., S. Decker, D. A. Kirsch, and A. Nix. 2021. “EMCODIST: A Context-Based Search Tool for Email Archives.” IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, pp. 2281–2290, doi: 10.1109/BigData52589.2021.9671832.
  • Lise, J., and R. Aaron. 2022. “Applying AI to Digital Archives: Trust, Collaboration and Shared Professional Ethics.” Digital Scholarship in the Humanities fqac073. doi:10.1093/llc/fqac073.
  • Liu, C. C., S. B. Srivastava, and T. E. Stuart. 2016. “An Intraorganizational Ecology of Individual Attainment.” Organization Science 27 (1): 90–105.
  • Lorigo, L., M. Haridasan, H. Brynjarsdóttir, L. Xia, T. Joachims, G. Gay, L. Granka, F. Pellacini, and B. Pan. 2008. “Eye Tracking and Online Search: Lessons Learned and Challenges Ahead.” Journal of the American Society for Information Science and Technology 59 (7): 1041–1052. doi:10.1002/asi.20794.
  • Nix, A., and S. Decker. 2021. “Using Digital Sources: The Future of Business History?” Business History 31 (Mar): 1–24. doi:10.1080/00076791.2021.1909572.
  • Nix, A., S. Decker, D. Kirsch, and S. Kuppili Venkata. 2023. “Archival Research in the Digital Era.” In Handbook of Historical Methods in Management, edited by S. Decker, W. Foster, and E. Giovannoni, UK: Edward Elgar Publishing Ltd.
  • Nix, A., S. Decker, and C. Wolf. 2021. “Enron and the California Energy Crisis: The Role of Networks in Enabling Organizational Corruption.” Business History Review 95 (4): 765–802.
  • Prom, C., K. Murray, F. Baker, M. Connelly, and W. Gogel. 2018. The Future of Email Archives: A Report from the Task Force on Technical Approaches for Email Archives. https://www.clir.org/pubs/reports/pub175/
  • Rosenzweig, R. 2003. “Scarcity or Abundance? Preserving the Past in a Digital Era.” The American Historical Review 108 (3) (June): 735–762. doi:10.1086/529596.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.