A third road to the past? Historical scholarship in the age of big data: Historical Methods: A Journal of Quantitative and Interdisciplinary History: Vol 50 , No 4

ABSTRACT

Is a third passage to the past possible, beyond Elton's and Fogel's two roads of narrative history and scientific/quantitative history? One that would combine narrative history's focus on the event, on individuals and their actions, at a particular time and place, to scientific/quantitative history's emphasis on explicit behavioral models based on social-science theories? That is the question this article addresses. It illustrates a computer-assisted methodology for the study of narrative—quantitative narrative analysis (QNA)—that does just that. Based on the 5 Ws + H of narrative—Who, What, When, Where, Why, and How—QNA quantifies events without losing the event itself, without losing people behind numbers, diachronic time behind synchronic statistical coefficients. When used in conjunction with dynamic and interactive data visualization tools (and new natural language processing tools), QNA may provide a third unforeseen road to the past.

KEYWORDS:

Acknowledgements

Prof. E.M. (Woody) Beck generously provided the initial list of lynching events and newspaper references. I am grateful to Walter Adamson, Gianluca De Fazio, Clive Griffin, Chris Gunn, Alex Hicks, Richard Lachmann, Alberto Purpura, Becky Sherman, and three Historical Methods anonymous reviewers for help with the project and manuscript.

Funding

The Georgia lynchings project was supported by grants from Emory University, University Research Committee (2008) and the Mellon Foundation, Digital Humanities Scholarship (2011).

Notes

1. Charivari (“rough music” in England, scampagnate in Italy) is the term used in France “to denote a rude cacophony, with or without more elaborate ritual, which usually directed mockery or hostility against individuals who offended against certain community norms” (Thompson Citation1991, 467).

2. The Vendee (Tilly 1964) and The Contentious French (Tilly 1986) are less quantitative than Tilly's other work (e.g., Shorter and Tilly 1974; Tilly 1995).

3. On “digital humanities” and “big data” see Burdick et al. (Citation2012), Berry (Citation2011), Liu (Citation2012, Citation2013), Kitchin (Citation2014), and McCarty (Citation2014).

4. On the press as source of socio-historical data see Franzosi (Citation1987).

5. The data were generously provided to me by Beck. The complete inventory is now available at http://lynching.csde.washington.edu/#/search.

6. Two types of newspaper databases are available: commercial, for pay, and open access, free. Commercial archives: GenealogyBank.com (owned by Readex) draws its newspaper data from two main Readex sources: Early American Newspapers and American Historical Newspapers; Proquest Historical Newspapers Database of early twentieth century US newspapers including The Atlanta Constitution, Boston Globe, Chicago Tribune, New York Times, New York Tribune, Wall Street Journal, and Washington Post. Open access archives: the Library of Congress collection of historical newspapers Chronicling America (http://chroniclingamerica.loc.gov/newspapers/) and the University of Georgia collection of Georgia historical newspapers (http://www.libs.uga.edu/gnp/).

7. Unfortunately, this work had to be done by hand. The column format of newspaper documents and the poor quality of the available pdf files proved too much of a challenge for even the most sophisticated, commercial OCR conversion software available to us.

8. One hundred sixty-seven women were involved as victims of the 442 black men lynched in Georgia between 1875 and 1930. For 77 of these women, we have no information. No name, no age. Newspapers do give the names of the other 90 women and for 20 we have relatively rich information (first and last name; age, including terms like young, little girl, old; race; residence).

9. The need for varied skills (from that of the historian, or the literary critic to that of the computer scientist) in the new era of “digital humanities” scholarship is one of the caveats of digital humanities, one that should not be underestimated, and with just as varied implications (on these issues, see for all, Berry Citation2011; Liu Citation2013; Burdick et al. Citation2012). First, the range of required skills points to the need perhaps of moving away from the humanist's “solo” research approach to team work, along the sciences' model. Second, while NLP and data visualization offer many free, open-source tools, these tools often have to be patched together—which requires specialized computer programming—and in any case understood.

10. All NLP software work with text formatted files only.

11. “Distant reading” refers to the automatic computer analysis of a corpus of text data based on NLP tools; this is in contrast to the “close reading” humans would approach those same texts (on these concepts see Moretti Citation2005, Citation2013). For all the limitations of “distant reading,” for large corpora, like the millions of documents one increasingly finds on the web, there may be no other option. In any case, NLP tools are becoming ever more accurate in their dealing with various aspects of language.

12. The Stanford CoreNLP, written in Java, http://stanfordnlp.github.io/CoreNLP/, provides a set of tools for the automatic analysis of natural language. CoreNLP is not the only NLP software. Apache OpenNLP, also written in Java, https://opennlp.apache.org/, provides a similar set of tools. And so does NLTK (Natural Language Toolkit), http://www.nltk.org/, written in Python. CoreNLP, OpenNLP, and NLTK are freeware and open source. They all perform the most common NLP tasks, such as sentence segmentation (breaking up a text into its constitutive sentences), tokenization (breaking a sentence into words, symbols, punctuation), lemmatization (finding a common single word, lemma, for inflected forms of a word, such as singular noun or infinitive verb for declined nouns or conjugated verbs), part-of-speech tagging (or POSTAG, identifying the grammatical form of different words as nouns, verbs, adjectives, adverbs, …), dependency parsing (or DEPREL, identifying the syntactic information of each word and the dependencies between words). They also perform more sophisticated tasks, such as named-entity recognition (NER), sentiment analysis, coreference resolution, semantic role labeling (SRL), and more.

13. There are different CoNLL formats since CoNLL has had several updates over the years. In general, in a CoNLL table, each line represents a single word (token) with a series of tab-separated fields (columns). CoNLL-U is becoming the new standard with its 10 fields: ID: Word index, integer starting at 1 for each new sentence; FORM: Word form or punctuation symbol (the very word found in the input document); LEMMA: Lemma or stem of word form; CPOSTAG: Universal part-of-speech tag drawn from CoreNLP revised version of the Google universal POS tags; POSTAG: Language-specific part-of-speech tag; NER: Named-Entity Recognition (a small set of pre-defined categories automatically recognized by CoreNLP, such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.); FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; HEAD: Head of the current token, which is either a value of ID or zero (0); DEPREL: Universal Stanford dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one; DEPS: List of secondary dependencies (head-deprel pairs).

14. The DEPREL field is found in the CoNLL table produced by the Stanford CoreNLP. It provides a representation of all sentence relationships as typed dependency relations.

15. The sentence index refers to the position of the sentence in the story: first sentence, second, third, …

16. On the process of data aggregation, see Franzosi (Citation2004, 98–100, 292–93, 394 endnotes 40–42, 395 endnotes 43–46; 2010, 93, 96, 110).

17. Corrigan, in his argument for a qualitative GIS, highlights the role of visual, “descriptive” representations for analysis, in suggesting new interpretations and new unanticipated relations and “hypotheses” (2010, 85–87).

18. The PC-ACE NGram viewer is a tool able to count the occurrences of some ordered sets of words (N-Grams) by year, in a user-selected corpus. Similarly to Google NGram Viewer, this tool produces a line chart that plots the frequency of the usage of the searched words overtime.

19. The word “outrage” shows a similar steady downward trend in Google Ngram Viewer.

20. To avoid having to reproduce the same identical map three times (the map as it is first displayed in Google Earth Pro, the map after the first click on the pin for Monticello and after the second click for Eula Baker's pin) I pasted together the three representations.

21. On the macro/micro link in QNA, see Franzosi (Citation2010b, 132; 2014).

22. No doubt stretching the notion of hypothesis testing from visual correlations … See Corrigan (Citation2010) and Peirce Lewis's impassionate presidential address to the Association of American Geographers (1985).

23. I used the Stanford CoreNLP tool to compute the CoNLL table of the complete narrative of the lynching of Hardy Grady (Columbus Daily Enquirer-Sun 5/16/1884, 1). A PC-ACE routine uses the CoNLL table to compute the sentence length by sentence index data and visualizes the information in an interactive Excel plot.

24. For an outline of the various reasons for charivaris, see Zemon Davis (Citation1971, 45).

25. Dock Posey, a white man, was lynched in Dalton, GA, on July 1, 1907, for raping his nine-year old stepdaughter.

26. Zemon Davis had used that same question-and-answer approach to scholarship in her previous work on charivaris, making an explicit claim to analysis: “Most books on “everyday life” in the late Middle Ages and early modern period … merely describe the curious charivaris and carnivals and stop short of analysis.” (Zemon Davis Citation1971, 46)

27. The search looks in the CoNLL table produced by Stanford CoreNLP of all the available newspaper articles for a specific word (e.g., “sheriff”) as found in the FORM field (the field contains every word found in the source document, one record per word, per sentence) to return all the actions (well, verbs as listed in the field POSTAG) performed by the sheriff. The search further specifies whether the word “sheriff” should be a syntactical subject or object, thus allowing us to see what a sheriff did and what was done to him (to this purpose the routine uses the field DEPREL which gives the Dependency Relation of the word to the other words in the sentence). The unit of analysis of this KWIC routine is the individual sentence.

28. The KKK's willingness to engage in violence may have been more pronounced in the South, but “the Klan everywhere had within it the potential for the vigilante violence which it carried out so proudly in Georgia” (Coben Citation1994, 157).

29. Member of the US House of Representatives and of the Georgia House of Representatives, would later be remembered as Rebecca Latimer Felton, the first woman representative to the US Senate albeit for one day only (November 21, 1922).

30. Thomas F. Dixon's novels The Leopard's Spots: A Romance of the White Man's Burden—1865–1900 (1902) or The Clansman: A Historical Romance of the Ku Klux Klan (1905); David W. Griffith's silent film The Birth of a Nation (1915), based on Dixon's The Clansman.

31. See Allen's collection of photographs in Without Sanctuary at http://withoutsanctuary.org/.

32. See also White (Citation1929, 28–29).

33. Of Braudel's La Mediterranee et le monde mediterraneen a l'epoque de Philippe II Elton wrote: “The … book offers some splendid understanding of the circumstances which contributed to the shaping of policy and action; the only things missing are policy and action” (Elton Citation1967, 122).

34. For instance, 98% accuracy for basic parsing and 65% for coreference resolution.

35. Computer-assisted QNA, with its hierarchical and relational data organizational structure, could not be carried out in traditional CAQDAS content-analysis programs (see Franzosi et al. Citation2013).

36. NER is one of the standard fields of the CoNLL table produced, for instance, by Stanford CoreNLP.

37. The replacement in a text of such pronouns as he, she, they … with the referenced entity (see the operation flowchart in Sudhahar, Veltri, and Cristianini Citation2015, 18).

38. At least using the Stanford CoreNLP. The humanist or social scientist working with a few hundred or even a few thousand documents can always work with an ad-hoc GUI (Graphical User Interface) to help quickly solve manually the 35% unprocessed cases.

39. For some examples, see Zervanou et al. (2015), Cristianini's work (Sudhahar, Veltri, and Cristianini Citation2015; Lansdall-Welfare et al. Citation2017); the seminal work by Moretti (Citation2005) or Goldstone and Underwood (Citation2014); Reagan et al. (Citation2016).

40. Michel et al. (Citation2011), in their seminal work on “culturomics,” illustrates the use of NLP with the several million books digitized by Google. The newspaper corpora analyzed by Cristianini's teams vary from “130,213 articles … from January to November 2012” (Sudhahar, Veltri, and Cristianini Citation2015, 4) to “millions of articles, representing 14% of all British regional outlets” for the last 150 years (Lansdall-Welfare et al. Citation2017, 457).

41. On sentence complexity see Pakhomov et al. Citation2011; on document readability, Schumacher and Eskenazi Citation2016; on concreteness Brysbaert et al. Citation2013; Hills and Adelman Citation2015; measures of voice and modality can be easily derived from the CoNLL table; sentiment analysis is one of the CoreNLP modules.

42. In any case, encroaching upon other disciplinary territories is no easy task. Moretti and Pestre's paper on “Bankspeak” is a good case in point (2015). Their analysis of the language of World Bank reports is not based on new, sophisticated NLP tools but on word frequencies. Yet, drawing sophisticated conclusions from even simple word frequencies requires a deep understanding by the authors of such linguistic issues as abstraction, nominalization, singularization, noun modifiers, gerundive verb forms, not part of the standard training of a computer scientist.

43. More user-friendly NLP tools (e.g., Voyant, Tacit, R, PC-ACE) are starting to make life easier for non-computer experts who want to carry out NLP tasks; but cutting-edge tools (e.g., the “shape of stories” by Reagan et al. Citation2016) remain well beyond the reach of average, even computer savvy, users.

44. See Alves and Queiroz's article title (Citation2015); the varied disciplinary background of Cristianini's co-authors in several of his publications is a good case in point.

A third road to the past? Historical scholarship in the age of big data

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

A third road to the past? Historical scholarship in the age of big data

ABSTRACT

Acknowledgements

Funding

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature