Publication Cover
The Serials Librarian
From the Printed Page to the Digital Age
Volume 80, 2021 - Issue 1-4: NASIG 2020
1,362
Views
0
CrossRef citations to date
0
Altmetric
Live Session

Practical Approaches to Linked Data

ABSTRACT

In this article, Jeannie Hartley presents research regarding linked data and its founder, and discusses how linked data can be used. Some benefits and challenges discovered in the research are discussed, as well as some requirements for the process. Focus is given to Research Description Framework (RDF) and Resource Description and Access (RDA) for use in linked data records. Hartley also offers an examination of some linked data projects. The article then transitions to Heidy Berthoud’s discussion of various linked data projects in process at the Smithsonian Libraries (SIL). She addresses the development of the linked data macro, SIL’s participation in the Program for Cooperative Cataloging’s (PCC) Uniform Resource Identifiers (URIs) in Machine Readable Cataloging (MARC) Pilot Project, and early explorations into Wikidata.

Introduction to linked data

Linked data does not have a specific and limited definition. The definition Guerrini and Possemato posed says that linked data is a “set of best practices required for publishing and connecting structured data on the web for use by a machine.”Footnote1 Linked data was developed by Tim Berners-Lee, the inventor of the internet. Berners-Lee’s TED talk, The Nest Web of Open Linked Data, provides an excellent introduction to the subject.Footnote2

Making data connectable by relationships is the key to the functionality of linked data. For example, there is a relationship between a book and its author, the fact that this author lives in Seattle is a second relationship, and the title of another book written by the same author is a third relationship. Connecting data together means using the semantic Web. It allows for these relationships between data to be structured in a way that it can be read by computers.Footnote3 The semantic web has four rules for making relationships between data:

  1. Uniform Resource Identifiers (URIs) should be used for naming things.

  2. Each item should have its own HTTP URI, meaning that there is a Web link that can be clicked on.

  3. There should be useful information provided upon clicking an HTTP URI.

  4. Links to other URIs should be included to provide additional information.Footnote4

Note that the Library of Congress has helpful URI FAQs on their website.Footnote5

The literature on linked data focuses on two cataloging coding standards that work well with linked data: Research Description Framework (RDF) and Resource Description and Access (RDA). RDA was drastically less popular in the literature but was used for linked data with a few projects due to the ease of using relationship designators which are already part of the RDA vocabulary.Footnote6 For those who are not avid catalogers, it is important to note that RDA is a standard for descriptive cataloging, and Machine-Readable Cataloging (MARC) is a metadata transmission standard. RDA MARC records are the most recent and most widely used method for cataloging library records, and MARC was originally developed for machines to read information so that it can be placed onto a catalog card.Footnote7 RDF is a newly developed standard that melds well with the semantic web and thus is a great standard to use for the web.Footnote8 RDF is written in Extensible Markup Language (XML), making it easier for a computer to process.Footnote9

If semantic Web rules are followed and a standard of coding that is compatible with linked data is used, then linked data connects information to look like a web of information bubbles. Tim Berners-Lee imagined that linked data would emerge as a representation of real things in web format, much like the Linked Open Data Cloud illustrates.Footnote10 Dbpedia is another great example of a website that currently uses linked data in real time with data uploaded by numerous entities.Footnote11

Interoperability between applications is a great benefit of linked data.Footnote12 A great example of interoperability would be Tim Berners-Lee’s discussion of social media in his TED talk. His example was that if you friend a person on one social media site and wanted your other sites to also recognize that person as a friend, that does not work because there is no data interoperability between applications.Footnote13 Another benefit, increased richness of data, is caused by the amount of relevant information provided at once which can be sifted through easier than, for example, when a Google search produces over 200,000 results, of which maybe 100 are relevant to the query.Footnote14 With linked data, there may be fewer results, but all of them will be relevant to the query.

Another useful benefit of linked data is geolocation data.Footnote15 Geolocation data is used more for mobile devices and can specifically be useful for libraries. Mobile devices populate results based on physical proximity to the phone in use. This is how geolocation data works, and when making library linked data available online, each library will populate data about available materials every time one of its patrons enters a search about something that library offers.Footnote16 This means that when a patron searches for a book, the nearest library’s catalog listing for that book will populate in the search. It also provides library access to those who were not thinking of the library or are oblivious that the library might have what they are looking for when they do a Google search for materials.Footnote17

A number of tools are available to help with the challenges of linked data. Conversion tools are available through the Library of Congress Bibliographic Framework Initiative (BIBFRAME Initiative). These conversion tools are particularly helpful for transitioning away from MARC to using RDF.Footnote18 Another useful tool is called MarcEdit7, a free downloadable tool that has a section called MARCNext, which is specifically designed to update records and play around in a sort of sandbox to see what can be done. This could potentially be used to edit existing catalog records en masse, and all it takes is seeing what each set of specific records need to have updated.Footnote19

Looking at the challenges of linked data, proper relationship designators are a big deal because all the data is going to be connected using these designators. Relationship designators were the reason RDA was favored in the literature about linked data because there is already a set vocabulary for relationship designators in RDA.Footnote20 The process of converting data uncovers conversion errors with data as it is changing over. These errors must be updated to continue the process of publishing data.Footnote21 Transitioning away from MARC is a huge change and libraries are wary of doing something different from what they are used to, especially if that library is transitioning to RDF away from MARC because RDF also requires a basic understanding of XML coding.Footnote22

As illustrated by the discussion regarding RDF and RDA, the literature did indicate that no one format for library resources is going to be perfect for using linked data. There will be elements of each one that pose issues, as well as elements with each one that make themselves beneficial to the process.Footnote23 Library relevance is a huge discussion that many libraries are having, and linked data is definitely something that, if implemented thoroughly, could solidify library relevance in a very visible way to public and academic patrons because of libraries’ “attention to the quality of information they produce.”Footnote24

An unexpected challenge to implementing linked data was that copyright might become an issue as linked data is implemented by more institutions.Footnote25 Law-related issues like this are big enough that some libraries may entirely forego the whole process out of nervousness or concern for doing it wrong. To that end, having examples to follow of what other libraries and entities have done can make a huge difference in combatting these challenges. The research revealed several linked data projects which are either currently in progress or are fully functioning which can serve as great examples of what to do and how to do it. Some examples of linked data projects are:

Linked data projects at the Smithsonian libraries

The Discovery Services Division (DISC) of the Smithsonian Libraries (SIL) had long devised plans to move library data out of MARC.Footnote27 However, before DISC could proceed on these projects, it needed to recruit colleagues with appropriate expertise. In September 2018, Heidy Berthoud was hired as head of Resource Description, followed in October 2018 by descriptive data management librarian, Jackie Shieh. These positions represented new roles within DISC and were immediately charged with rethinking existing workflows and developing new ones. Work soon came up against the limitations of the existing technological infrastructure. SIL uses the SirsiDynix Integrated Library System (ILS) Horizon, which has been in place since December 1999. This technology problem is further compounded by knowledge siloes and system implementation decisions unique to the Smithsonian Institution (SI). Because we are in a federal library environment with heightened security concerns, some of the large batch exports and imports of data from our catalog have historically been handled by staff belonging to the Office of the Chief Information Officer (OCIO), SI’s central computing and telecommunications department. This division of labor resulted in its own set of challenges. Another hurdle was lack of available budget; though there was – and remains – great interest in participating in library linked data initiatives, such as the Share Virtual Discovery Environment (SHARE-VDE) project, SIL has been unable to secure the necessary funding as of this writing. Still, we found other ways to forge ahead.

Linked data elements in MARC records

DISC began actively exploring the addition of linked data elements to MARC records in October 2018. SIL had long used macros built using Macro Express to handle tasks such as record imports, adding donor information, and adding internal notes, so it was decided that a second import macro incorporating linked data building tasks should be designed. Shieh had a version of this macro ready by December 2018, and practical testing was handled by Berthoud and Erik Bergstrom, the head of Electronic Resources and Serials. In this early iteration of the macro, the record would be sent to MarcEdit to perform the following MARC data transformations: 1) generate data using RDA helper; 2) add URIs in $0 and/or $1 to headings (including tracings, subject fields, and the OCLC work identifier; and finally 3) add a relationship in $4 to MARC field 758 Resource Identifier. However, the government shutdown and furlough of January 2019 effectively halted progress.

Refining the linked data macro

DISC regrouped and was ready to revisit the macro by spring 2019. This coincided with Phase 1 of the Program for Cooperative Cataloging’s (PCC) new policy on the limited use of International Standard Bibliographic Description (ISBD) punctuation in MARC records.Footnote28 Beginning on April 8, 2019, Phase 1 gave PCC member libraries the option to either include or drop terminal punctuation when authenticating bibliographic records. Though the email did not specify the precise timeline, it did note that at a later date, Phase 2 would allow the omission of not only terminal, but medial ISBD punctuation; that is, the ISBD punctuation found at the end of MARC subfields.

After holding some strategy meetings, DISC decided to redesign the current macro to include a task that would strip terminal and medial ISBD punctuation from bibliographic records upon import; we were assuming that by the time the macro was ready for implementation, we would have entered PCC’s second phase on limited use punctuation. Shieh created a task for this punctuation strip using the PCC’s March 29, 2019 document Guidelines for Implementing New PCC Policy Regarding Terminal Periods in MARC Bibliographic Records; we later relied on the report PCC guidelines for minimally punctuated MARC bibliographic records.Footnote29 We commenced testing the new iteration of the macro in the summer and fall of 2019. There was some troubleshooting involved in making sure only terminal and medial punctuation was being stripped. Shieh also discovered in the course of designing the macro that the tasks need to occur in a precise order for the macro to work.

The deployed version combined Windows shell and MarcEdit tasks. The macro calls Powershell to execute the following tasks with MarcEdit: 1) add URIs in $0/$1; 2) convert MARC8 .mrc to UTF8 .mrk to perform RDA transformations and strip punctuation; 3) add a relationship $4 to the 758 field; and finally 4) convert the UTF8 .mrk file to MARC8 .mrc for load into Horizon. When this sequence was finalized, we again tested the macro among a small group, this time expanding the test pool to include the Special Collections cataloger, and three catalogers at the Freer Gallery of Art and Arthur M. Sackler Gallery Library (FSG Library).

Participation in the PCC URIs in MARC pilot

In September 2019, the PCC announced the launch of its URIs in MARC Pilot. The purpose of the pilot is “ … to engage metadata practitioners in formally applying techniques to further the PCC’s linked data transition … The pilot activities will chiefly involve adding identifiers to bibliographic records and/or to NACO authority records.”Footnote30 The pilot project engages volunteer participants – from both PCC and non-PCC libraries – in the practical work of adding linked data elements to MARC records. This practical work is a continuation of PCC’s efforts in crafting policy around the library transition from MARC records to linked data, as seen in the efforts of various task groups, such as the Task Group on URIs in MARC (2018) and the Task Group on Linked Data Best Practices (2018).Footnote31 Volunteers for the pilot were given the opportunity to design a course of action that would fit with the goals of their institution, and could decide whether they would want to work with bibliographic data, authority data, or both.

The timing was fortuitous. The linked data macro had been a long time in development, and the PCC pilot project gave us impetus to move it from test to production. The pilot was officially launched by the PCC in two virtual introductory meetings, held December 3 and December 6, 2019.Footnote32 Team members in Resource Description switched over to the linked data macro for their daily work on January 27, 2020. This expanded the pool of those using the macro by an additional three team members. For the purposes of the PCC pilot project, Resource Description and the Special Collections cataloger are adding linked data elements to bibliographic records only via the linked data macro. Catalogers at the FSG Library are both using the macro to enhance bibliographic data and are working on a separate project involving authority data.

Implementation and changes

The roll-out of the macro meant training team members to approach their cataloging work in a new way. Staff involved in cataloging were accustomed to searching each bibliographic record in OCLC, exporting all related name authority records, exporting the bibliographic records, and importing those records as a set. However, authority records could no longer be imported using the new linked data macro; during testing, we found that RDA helper attempted to insert 33X fields for content, media, and carrier. Therefore, we needed to instruct staff to stop exporting authorities during the copy and original cataloging process. This work would now be done as a weekly batch process post record import.

The introduction of this macro to staff at the technician and non-supervisory librarian level also resulted in a major cultural shift. Prior to this, staff members involved in cataloging activities below the supervisory librarian level were never expected to work on data in batches, manipulate data in MarcEdit (or any other tool), and many of them did not even have MarcEdit installed on their machines. This led to some installation challenges leading up to the launch of the macro, as there was a push to install software and task files on various machines. Not only did we install all necessary files on individual computers, but the computers in our shared conference spaces as well, to enhance group training and problem solving.

There needed to be a shift in mindset: Where in the past, manipulating data had belonged to the purview of managers only (and even then, only occasionally), this would now be an activity that would be normalized, emphasized, and developed. The linked data macro handled much of the heavy lifting in this particular project, but future projects will demand a different array of skills. Concurrent with the development of the linked data macro was an effort within DISC and beyond to look at cataloging a different way. In those months since development of the linked data macro first began, there have been group webinars, opportunities for team members to do professional development in MarcEdit, and chances to learn new programs like OpenRefine, an open source tool for manipulating and cleaning data.

Wikidata

This article will not delve too deeply into the history, development, or workings of Wikidata (a free online repository storing structured data in the form of simple relationship statements between items and properties) or Wikibase (a local installation of a Wikidata repository). Rather, it will describe how SIL is moving forward in developing staff in this area and building momentum and support for the project across SI. Right now, we are thinking chiefly of Wikidata as a way to manage identities, create relationships between people and objects across the disparate parts of SI, and perhaps describe some of our unique collections that are not well served by MARC. However, we recognize that this is an opportunity to think creatively about our data, and we want to remain open to different possibilities.Footnote33

Learning together

SIL is lucky to have an experience Wikimedian on staff: Diane Shaw, Special Collections cataloger, has been contributing to Wikipedia regularly since 2012 and has been editing Wikidata (with over 33,000 edits to date) since 2017. In addition to attending numerous Wikimania and invitation-only Wikimedia Conferences, Shaw is active in the local Wikimedian Community and has served on the Board of Directors of Wikimedia DC since 2014. Others at SIL are much newer to these efforts; Berthoud and Shieh participated in a Wikidata workshop hosted by the PCC at the May 2019 Operations Committee meeting, and Shieh did further online training with WikiEdu from October to December 2019.Footnote34 The first Wikidata event hosted by SIL was a Wikidata workshop and edit-a-thon held on November 20, 2019. The workshop was led by Andrew Lih, a researcher and active Wikimedian, who authored The Wikipedia revolution. The edit-a-thon was a daylong event meant to introduce interested library staff to Wikidata from the ground up. This original group from the edit-a-thon continued to meet monthly on mini Wikidata edit-a-thons geared towards learning new aspects of Wikidata together. Some sessions focused on improving Wikidata records for established individuals, such as secretary of the Smithsonian Lonnie Bunch and Theodore Roosevelt. Other sessions began to introduce team members to the concept of data modeling and brainstorming properties we would need for a local installation of Wikibase. This group became the core of the SIL Wikidata Team.

Concurrent to this, a smaller group including Berthoud, Shaw, Shieh, and Suzanne Pilsk, head of the Metadata Department, were meeting to craft a proposal to create a property in Wikidata denoting an identifier for a resource held by the SI. If one were to map the structure of Wikidata onto the structure of a triple statement of “subject, predicate, object,” the Wikidata property is the predicate that defines the relationship between two Wikidata items.Footnote35 The property we proposed would allow us to find an item in Wikidata that was held by the SI and augment the existing record with a new statement defining its SI creative work identifier and providing its Smithsonian assigned Globally Unique Identifier (GUID). This proposal was workshopped in November and December of 2019, and the proposal was accepted by Wikidata soon after. “Identifier for a resource held by the Smithsonian Institution (P7851)” now appears in Wikidata’s list of properties.Footnote36

The move to an enhanced telework environment in mid-March 2020 had the unintended consequence of reenergizing interest around our Wikidata work. A much larger library team is able to meet now that we have shifted to a virtual environment. An online team was created in our Microsoft Teams space, and meetings have continued at regular intervals. From March 2020 to May 2020, we focused a great deal of attention around discussing properties we would need for our local Wikibase. That Wikibase has now been installed, though not without some difficulties, and we have started the process of creating these properties. On June 10, 2020, the SIL Wikidata Team and metadata practitioners across SI museums and archival units had a productive workshop with trainers from Wikimedia Deutschland (WMDE). Participants worked on data modeling, data import, and batching editing and importing. Outcomes of the workshop resulted in four focused projects to pursue. These activities also raised the WMDE’s interest in potential further collaboration with SI. Wikidata team members have been spending time in July 2020 attending virtual sessions of the LD4 2020 conference.

Future plans

This is an area that seems primed for exponential growth. After our June workshop, we were contacted by representatives from WMDE proposing an ongoing collaboration geared towards building a prototype or production Wikibase; a meeting has been set for August 12, 2020 to discuss this project further. We have continued to work on data modeling and have now created several small pilots to better understand local needs. Two projects focus on properties needed to describe individuals, one for portraits held in our Dibner Library of the History of Science and Technology, and the other for a Chinese ancestor project currently ongoing at FSG Library. Another project will try to define properties needed to define topical headings that have been locally developed over several decades to describe items held in the Warren M. Robbins Library of the National Museum of African Art. The final project is working to translate Smithsonian Research Online (SRO) into Wikidata. Finally, on July 17, 2020, the PCC sent out a call for another pilot project, this one on Wikidata. SIL will be participating.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Notes on contributors

Heidy Berthoud

Heidy Berthoud is the Head of Resource Description, Smithsonian Libraries, Washington, D.C.

Jeannie Hartley

Jeannie Hartley is the Cataloging and Serials Librarian at Friends University’s Edmund Stanley Library in Wichita, KS.

Notes

1. Mauro Guerrini and Tiziana Possemato, “Linked Data: A New Alphabet for the Semantic Web,” Italian Journal of Library, Archives & Information Science 4, no. 1 (2013): 67.

2. Tim Berners-Lee, “The New Web of Open, Linked Data,” TED video, 16:51. Posted, March 13, 2009, https://www.youtube.com/watch?v=OM6XIICm_qo (accessed September 7, 2020).

3. Guerrini and Possemato, “Linked Data,” 80.

4. Eric Mitchell, “Building Blocks of Linked Open Data in Libraries,” in Library Linked Data: Research and Adoption, ed. Patrick Hogan (Chicago: American Library Association, 2014), 17.

5. Program for Cooperative Cataloging, “URI FAQs, Library of Congress Program for Cooperative Cataloging,” September 26, 2018, https://www.loc.gov/aba/pcc/bibframe/TaskGroups/URIFAQs.pdf (accessed January 28, 2020).

6. Anita Goldberga et al., “Identification of Entities in the Linked Data Collection ‘Rainis and Aspazija’ (RunA),” Italian Journal of Library, Archives & Information Science 9, no. 1 (2018): 99.

7. Kimmy Szeto, “The Mystery of the Schubert Song: The Linked Data Promise,” Music Library Association 74, no. 1 (2017): 16.

8. Ibid.

9. “XML RDF,” W3Schools, June 2018, https://www.w3schools.com/xml/xml_rdf.asp (accessed September 7, 2020).

10. Tim Berners-Lee, “The New Web”; and John P. McCrae, “The Linked Open Data Cloud,” The Linked Open Data Cloud, https://lod-cloud.net/ (accessed April 29, 2019).

11. “DBpedia,” DBpedia, https://wiki.dbpedia.org/ (accessed April 29, 2020).

12. Tim Berners-Lee, “The New Web.”

13. Ibid.

14. Ibid.

15. Ibid.

16. Oliver Pesch and Eric Miller, “Using BIBFRAME and Library Linked Data to Solve Real Problems: An Interview with Eric Miller of Zepheira,” Serials Librarian 71, no. 1 (2016): 6.

17. Ibid., 2.

18. “Bibliographic Framework Initiative,” Library of Congress, https://www.loc.gov/bibframe/ (accessed April 30, 2020).

19 Terry Reese and Eric Bergstrom, “The Now and Future of MarcEdit: A Day-Long Workshop,” Serials Librarian 76, no. 1-4 (2019): 7.

20. Goldberga et al., “Identification of Entities in the Linked Data Collection ‘Rainis and Aspazija’ (RunA),” 99.

21. Guerrini and Possemato, “Linked Data,” 87.

22. Brighid M. Gonzales, “Linking Libraries to the Web: Linked Data and the Future of the Bibliographic Record,” Information Technology & Libraries 33, no. 4 (2014): 10-22.

23. Goldberga et al., “Identification of Entities in the Linked Data Collection ‘Rainis and Aspazija’ (RunA),” 99; and Philip Schreur, “RDA, Linked Data, and The End of Average,” Italian Journal of Library, Archives & Information Science 9, no. 1 (2018): 122-4.

24. Guerrini and Possemato, “Linked Data,” 76.

25. Gonzales, “Linking Libraries to the Web,” 16.

26. Goldberga et al., “Identification of Entities,” 84.

27. DISC is part of SIL’s central services, meaning it handles work across all our branch libraries. We are not the only Division involved in linked data work. The Digital Programs and Initiatives Division (DPI), another SIL central service, also has a long history of linked data work.

28. Xiaoli Li to [email protected], April 2, 2019 “New policy regarding limited use of ISBD punctuation in bibliographic records,” https://listserv.loc.gov/cgi-bin/wa?A2=ind1904&L=PCCLIST&P=4574.

29. The March 29, 2019 document was received as an email attachment to Li. For information on the January 2020 implementation, see Jennifer W. Baxmeyer to [email protected], January 10, 2020, “PCC guidelines for minimally punctuated MARC bibliographic records,” https://listserv.loc.gov/cgi-bin/wa?A2=ind2001&L=PCCLIST&P25099. See also PCC Standing Committee on Applications, PCC Guidelines for Minimally Punctuated MARC Bibliographic Records, September 2019, revised March 2020, https://www.loc.gov/aba/pcc/documents/PCC%20Guidelines%20for%20Minimally%20Punctuated%20MARC%20Data%20v.1.1.docx.

30. See John Riemer to [email protected], September 27, 2019, “URIs in MARC pilot: call for participation,” https://listserv.loc.gov/cgi-bin/wa?A2=ind1909&L=PCCLIST&P=56439.

31. See PCC Task Group on Linked Data Best Practices, Final Report, September 12, 2019, https://www.loc.gov/aba/pcc/taskgroup/linked-data-best-practices-final-report.pdf; PCC Task Group on URIs in MARC, Formulating and Obtaining URIs: A Guide to Commonly Used Vocabularies and Reference Sources, version date January 15, 2020, https://www.loc.gov/aba/pcc/bibframe/TaskGroups/formulate_obtain_URI_guide.pdf; and PCC Task Group on URIs in MARC, URI FAQs, September 26, 2018. https://www.loc.gov/aba/pcc/bibframe/TaskGroups/URI%20FAQs.pdf

32. See Michelle Durocher, “PCC URIs in MARC Pilot: Welcome!,” November 25, 2019. As of July 2020, the work of the pilot is ongoing, and virtual meetings have continued.

33. SIL is not unique in thinking about Wikidata this way. LD4P group has been thinking about this before April 2019, with the inception of the LD4 Wikidata Affinity group. The Library of Congress has its own Wikidata instance to test out authority control as well as the Deutsche Nationalbibliothek and the Bibliothèque Nationale de France. There is also a concerted effort within the PCC libraries to experiment with Wikidata in general.

35. Wikidata assigns numbers to each item and property in its repository; items are given Q numbers, while properties are given P numbers.

36. For those who are curious, the list of existing properties can be searched at https://www.wikidata.org/wiki/Wikidata:List_of_properties. Information about creating a new property is available at https://www.wikidata.org/wiki/Wikidata:Property_proposal.