1,657
Views
4
CrossRef citations to date
0
Altmetric
Original Articles

Binary trees? Automatically identifying the links between born-digital records

Pages 77-99 | Published online: 06 Aug 2017
 

Abstract

The sheer volume of records that government organisations, and thus government archives, work with on a daily basis means that there is a chance that relationships between individual records will not easily be captured and recorded. This paper begins by suggesting that the relationships described in archival catalogues will remain at the highest levels of abstraction unless they can be extracted using automated methods. Relationships that can be generated automatically are described in this paper. They will likely be less established than archivists are traditionally used to working with. For example, a so-called ‘fuzzy matching’ technique is discussed that may reveal the ‘points’ of similarity between two records. Extensible databases will be needed to store new links; flexible interfaces will be required to display them. This paper discusses some of the techniques that may currently be available for automatically identifying links between born-digital records by looking at what can be found in the data stream and the relationships digital formats inherently describe. The mechanisms described may be useful for sentencing as well as cataloguing and description. While one size will not fit all, some collections may benefit. The paper concludes by discussing briefly what this work will mean to the end user.

Notes

1. Domosphere, ‘Data Never Sleeps 3.0’, 2015, available at <https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/>, accessed 15 November 2016.

2. The National Archives, UK, ‘The Application of Technology-assisted Review to Born-Digital Records Transfer, Inquiries and Beyond: Research Report’, available at <http://www.nationalarchives.gov.uk/documents/technology-assisted-review-to-born-digital-records-transfer.pdf>, accessed 17 January 2016.

3. Archives New Zealand, ‘Disposal–Sentencing, October 2016’, available at <http://records.archives.govt.nz/assets/Guidance-new-standard/Disposal-Sentencing-16-G10.pdf>, accessed 11 April 2017.

4. WM Duff and V Harris, ‘Stories and Names: Archival Description as Narrating Records and Constructing Meanings’, Archival Science, vol. 2, no. 3, September 2002, pp. 263–85.

5. T Nesmith, ‘Reopening Archives: Bringing New Contextualities into Archival Theory and Practice’, Archivaria, vol. 60, Fall 2005, pp. 259–74.

6. Personal communication with Talei Masters, October 2016.

7. International Council on Archives (ICA) Experts Group on Archival Description, ‘Records in Contexts – A Conceptual Model for Archival Description, Consultation Draft v0.1’, 2016, available at <http://www.ica.org/sites/default/files/RiC-CM-0.1.pdf>, accessed 15 November 2016.

8. EDiscovery is defined as: ‘The discovery or disclosure of electronic information for the purposes of litigation. This phrase is used in the United States but is also the common descriptor for software tools that assist with eDiscovery–eDisclosure in the United Kingdom’, The National Archives, UK.

9. L Masterman, ‘From Digital Literacy to Digital Capabilities’, available at <https://blogs.it.ox.ac.uk/acit-news/2016/05/18/dig-lit-and-dig-cap/>, accessed 17 January 2017.

10. Wikipedia.org, ‘Cryptographic Hash Function’, available at <https://en.wikipedia.org/w/index.php?title=Cryptographic_hash_function&oldid=749155568>, accessed 15 November 2016.

11. Society of American Archivists, ‘Glossary’, available at <http://www2.archivists.org/glossary/terms/f/fixity>, accessed 15 November 2016.

12. J Kornblum, ‘Identifying Almost Identical Files Using Context Triggered Piecewise Hashing’, 2006, available at <https://www.dfrws.org/sites/default/files/session-files/pres-identifying_almost_identical_files_using_context_triggered_piecewise_hashing.pdf>, accessed 15 November 2016.

13. J Oliver, C Cheng and Y Chen, ‘TLSH – A Locality Sensitive Hash’, 2014, available at <https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf>, accessed 15 November 2016.

14. Wikimedia Commons, ‘File: A Corridor of Files at The National Archives UK.jpg’, available at <https://commons.wikimedia.org/w/index.php?title=File:A_corridor_of_files_at_The_National_Archives_UK.jpg&oldid=213314497&uselang=en-gb>, accessed 15 November 2016.

15. J Kornblum, ‘ssdeep – Latest Version 2.13’, available at <http://ssdeep.sourceforge.net/>, accessed 15 November 2016.

16. J Oliver, S Forman and C Cheng, ‘Using Randomization to Attack Similarity Digests’, 2015, available at <https://github.com/trendmicro/tlsh/blob/master/Attacking_LSH_and_Sim_Dig.pdf>, accessed 15 November 2016.

17. Apache.org, ‘Apache Tika’, available at <https://tika.apache.org/>, accessed 15 November 2016.

18. A batch, or shell script, is an automation script, its type specific to the operating system, that literally ‘scripts’, in order, activities for the system to perform, examples of which are used later in the paper.

19. P Burnhill, M Mewissen and R Wincewicz, ‘Reference Rot in Scholarly Statement: Threat and Remedy’, 2015, available at <http://hiberlink.org/Insight.htm>, accessed 15 November 2016.

20. P Warden, ‘catdoc’, available at <https://github.com/petewarden/catdoc>, accessed 15 November 2016.

21. R Spencer, ‘ASA Binary Trees: E-accession Hyperlinks Rudimentary Extract’, 2016, available at <https://gist.github.com/ross-spencer/a6411a021afb7de7e3dc6dd713f7b520/aa3f40dd48def93ad900e4d025ab15ab11da044d>, accessed 9 May 2017.

22. R Spencer, ‘tikalinkextract’, available at <https://github.com/httpreserve/tikalinkextract>, accessed 9 May 2017.

23. R Spencer, ‘httpreserve/eaccession-research: eAccessions Hyperlinks Version 1.0.0’, 2017, available at <http://doi.org/10.5281/zenodo.495809>, accessed 9 May 2017.

24. K Zhou, R Tobin and C Grover, ‘Extraction and Analysis of Referenced Web Links in Large-scale Scholarly Articles’, 2014, available at <http://homepages.inf.ed.ac.uk/kzhou2/papers/dl2014-zhou.pdf>, accessed 15 November 2016. Burnhill, Mewissen and Wincewicz.

25. D Noonberg, ‘PDFTOHTML’, available at <http://pdftohtml.sourceforge.net/>, accessed 15 November 2016.

26. J Goyvaerts, ‘Detecting URLs in a Block of Text’, 2008, available at <http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/>, accessed 15 November 2016.

27. J Zittrain, K Albert and L Lessig, ‘Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations’, 2014, available at <http://harvardlawreview.org/2014/03/perma-scoping-and-addressing-the-problem-of-link-and-reference-rot-in-legal-citations/>, accessed 15 November 2016.

28. Internet Archive, available at <http://archive.org>, accessed 15 November 2016.

29. T Berners-Lee, ‘Information Management: A Proposal’, 1990, available at <https://www.w3.org/History/1989/proposal.html>, accessed 15 November 2016.

30. Wikipedia.org, ‘Enterprise Content Management (ECM)’, 2017, available at <https://en.wikipedia.org/w/index.php?title=Enterprise_content_management&oldid=759565057>, accessed 18 January 2017.

31. Objective.com, ‘Enterprise Content Management’, available at <http://www.objective.com/products/enterprise-content-management>, accessed 15 November 2016.

32. Regex101.com, ‘Untitled’, 2017, available at <https://regex101.com/r/ry8aTb/1>, accessed 19 January 2017.

33. International Standards Organisation, ‘Information and Documentation – Records Management – Part 1: Concepts and Principles’, 2016, available at <https://www.iso.org/obp/ui/-iso:std:iso:15489:-1:ed-2:v1:en>, accessed 15 November 2016.

34. J Spolsky, ‘Why Are the Microsoft Office File Formats So Complicated?’, 2008, available at <http://www.joelonsoftware.com/items/2008/02/19.html>, accessed 15 November 2016.

35. ICA Experts Group on Archival Description, p. 13.

36. Archives New Zealand, ‘Digital Future Summit-video’, 2007, available at <https://www.archway.archives.govt.nz/ViewFullItem.do?code=24991813&digital=yes>, accessed 15 November 2016.

37. W3Schools, ‘HTML <img> src Attribute’, 2017, available at <http://www.w3schools.com/tags/att_source_src.asp>, accessed 18 January 2017.

38. Archives New Zealand, ‘Digital Future Summit-video – Direct Download Link’, 2007, available at <http://ndhadeliver.natlib.govt.nz/delivery/DeliveryManagerServlet?dps_pid=IE25298510>, accessed 15 November 2016.

39. Microsoft Development Network, ‘[MS-PPT]: PowerPoint (.ppt) Binary File Format’, available at <https://msdn.microsoft.com/en-us/library/office/cc313106(v=office.12).aspx>, accessed 15 November 2016.

40. Dublin Core Metadata Initiative, ‘DCMI Metadata Terms: hasPart’, 2012, available at <http://dublincore.org/documents/dcmi-terms/-terms-hasPart>, accessed 15 November 2016.

41. Development and MARC Standards Office: Library of Congress, ‘MARC to Dublin Core Crosswalk’, 2008, available at <https://www.loc.gov/marc/marc2dc.html>, accessed 18 January 2017.

42. ICA Experts Group on Archival Description, p. 40.

43. Wikipedia.org, ‘Flatiron Building’, 2016, available at <https://en.wikipedia.org/w/index.php?title=Flatiron_Building&oldid=742046805>, accessed 15 November 2016; The Stanford Natural Language Processing Group, ‘Stanford Named Entity Recognizer (NER)’, available at <http://nlp.stanford.edu/software/CRF-NER.shtml>, accessed 15 November 2016.

44. A Van Os, ‘Antiword: A Free MS Word Document Reader’, available at <http://www.winfield.demon.nl/>, accessed 15 November 2016.

45. NLTK Project, ‘Natural Language Toolkit’, available at <http://www.nltk.org/>, accessed 15 November 2016.

46. Tika Wiki, ‘Named Entity Recognition (NER) with Tika’, available at <https://wiki.apache.org/tika/TikaAndNER>, accessed 15 November 2016.

47. Twitter Help Center, ‘Using Hashtags on Twitter’, available at <https://support.twitter.com/articles/49309>, accessed 15 November 2016.

48. Duff and Harris.

49. The National Archives, ‘Digital Strategy 2017’, available at <http://www.nationalarchives.gov.uk/documents/the-national-archives-digital-strategy-2017-19.pdf>, accessed 11 April 2017.

50. D Verhoeven, ‘As Luck Would Have It: Serendipity and Solace in Digital Research Infrastructure’, 2016, available at <http://www.academia.edu/21802414/As_Luck_Would_Have_It_Serendipity_and_Solace_in_Digital_Research_Infrastructure>, accessed 15 November 2016.

51. T Sherratt, ‘Life on the Outside: Collections, Contexts and the Wild, Wild Web’, 2014, available at <https://medium.com/@wragge/life-on-the-outside-collections-contexts-and-the-wild-wild-web-4d334ccddee2#.fvpib06h0>, accessed 18 January 2017.

52. D Cohen, ‘CC0 (+BY)’, 2013, available at <http://www.dancohen.org/2013/11/26/cc0-by/>, accessed 18 January 2017.

53. T Gollins and E Bayne, ‘Finding Archived Records in a Digital Age’, in M Moss, B Endicott-Popovsky and MJ Dupuis (eds), Is Digital Different? Facet Publishing, London, 2015, pp. 128–48.

54. ibid., p. 145.

55. Verhoeven.

56. ICA Experts Group on Archival Description, p. 9.

57. Approximate number of born-digital items in the collection at the time of writing.

58. Kornblum, ‘Identifying Almost Identical Files’.

Log in via your institution

Log in to Taylor & Francis Online

There are no offers available at the current time.

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.