877
Views
1
CrossRef citations to date
0
Altmetric

ABSTRACT

This report summarizes an all-day workshop on linked data, with emphasis on its application to the description and discovery of serials. The presenters reviewed the history, lexicon, and technological underpinnings of linked data. They discussed the basics of ontologies, including an exercise with Protégé ontology-building software. They introduced Resource Description Framework (RDF), the basic structure for linked data, and discussed various syntaxes used with it, with an emphasis on Turtle (Terse RDF Triple Language). They examined individual Resource Description and Access elements for serials expressed in the Bibliographic Framework Transition Initiative (BIBFRAME) 2.0 ontology, and gave a brief overview of BIBFRAME testing and experimentation projects by the Library of Congress, Cooperative Online Serials, and the Linked Data for Production project.

A few days before the conference began, presenters Amber Billey (a systems and metadata librarian) and Robert Rendall (a serials cataloger) e-mailed participants signed up for this preconference. They encouraged us to bring laptops to the workshop, since it would involve hands-on activities. And if participants were bringing a laptop, they asked that we install Protégé, an open source ontology editor, and a text editor such as NotePad++ (for Windows) or Atom (for Mac or Windows).Footnote1 Finally, they recommended reading three introductory texts on the foundational concepts of linked data.Footnote2

Linked data 101

Rendall opened the workshop with a general introduction to linked data and why it is of interest to libraries. Tim Berners-Lee, inventor of the Internet and founder of the World Wide Web Consortium (W3C), the web standards organization, coined the term “linked data” in 2006. He described it as a way to structure data linked in meaningful ways to create the Semantic Web. He created four rules for linked data:

  1. Use Uniform Resource Identifiers (URIs) as names for things

  2. Use Hyper Text Transfer Protocol (HTTP) URIs so that people can look up those names

  3. When someone looks up a URI, provide useful information, using standards

  4. Include links to other URIs, so they can discover more things.Footnote3

To support the creation of linked open data, in 2010 Berners-Lee articulated a five-star standard for its design:

★ Available on the web (whatever format) but with an open licence, to be Open Data

★★ Available as machine-readable structured data (e.g., Excel instead of image scan of a table)

★★★ As (2) plus non-proprietary format (e.g., Comma Separated Values [CSV] instead of Excel)

★★★★ All of the above, plus use open standards from W3C … to identify things, so that people can point at your stuff

★★★★★ All of the above, plus link your data to other people’s data to provide contextFootnote4

Linked data are built on a group of technologies: the URI; HTTP; Resource Description Framework (RDF); RDF Schema (RDFS) and Web Ontology Language (OWL), both languages used to create ontologies; and SPARQL Protocol and RDF Query Language (SPARQL), a language used to retrieve data stored in RDF.

Rendall explained that libraries want to transition from a machine-readable cataloging (MARC) environment to linked data so that their data can interact with the rest of the information world. He cited several examples of entities that are using linked data, including DBpedia and Wikidata, the European Union, and the GeoNames geographical database.Footnote5 Rendall also referred to the Linked Open Data Cloud to show the growing number of entities using linked data.Footnote6 He remarked, “This is the world that libraries are trying to become part of.”

Ontology basics

Billey then explained why and how libraries are beginning to move away from the information silo of MARC to the open standards of the web. 2018 is the 50th anniversary of the development of MARC by Henriette Avram. MARC became an International Organization for Standardization standard in 1974, and became the format for encoding libraries’ bibliographic data for the next several decades. But over the last thirty years the Internet was invented and a set of worldwide, open information standards began to be developed. Billey concluded, “If we want to publish our metadata as documents, so our users can find our resources once they have reached our website and found our catalog, we want to work with standards the web can do something with.”

To clarify differences between MARC and linked data further, Billey described the parts and functions of both systems. In the former, MARC is the data syntax, the integrated library system provides data storage and delivery, Z39.50 is the vehicle for query and transmission, and, in recent years, a discovery layer can provide additional data delivery. With a linked data system, RDF provides the data model, but there is not one prescribed syntax, so it is much more flexible in that regard. Languages commonly used include Extensible Markup Language (XML), JavaScript Object Notation for Linked Data (JSON-LD), Notation3 (N3), and Terse RDF Triple Language (Turtle), but others can be used. Triplestores provide data storage, the SPARQL language is used for querying, and the ubiquitous web is used for data delivery.

To begin moving the library world from a MARC-based data model to linked data, the Library of Congress began the Bibliographic Framework Transition Initiative (BIBFRAME) in 2012.Footnote7 Billey explained with admirable clarity what BIBFRAME is, and how it functions in the linked data world:

BIBFRAME is a linked data vocabulary. In the data world, the term “vocabulary” is used to mean any published set of classes and properties. In this context, classes are things that can be described and properties are relationships that can be described between classes. A vocabulary that is encoded for machine processing using a W3C standard markup metalanguage such as OWL or RDFS is an ontology. Encoding the vocabulary in a markup metalanguage tells the computer what to do with it. A vocabulary is an idea of how you want to structure data, and an ontology is a formal markup of the vocabulary that make it machine processable. BIBFRAME draws heavily on MARC to prevent the loss of legacy data. It has been encoded in OWL as a formal ontology. BIBFRAME is intended to accommodate Functional Requirements for Bibliographic Records (FRBR) and Resource Description and Access (RDA), but can accommodate other structure and content standards. This will allow flexibility in a continuously evolving standards landscape.

To explain how all these working parts mesh, and to compare linked data with the historical context of library cataloging, Billey next described the functions of different information standards and gave examples of each. There are standards for how data are structured; examples include MARC, FRBR, BIBFRAME, and others. Other standards address content; examples include RDA; Anglo-American Cataloging Rules, Second Edition; Describing Archives: A Content Standard; and others. Still other standards address values; examples of these include the Library of Congress Subject Headings, RDA Vocabularies, the Library of Congress Name Authority File, and others. Finally, there are encoding standards; examples include the 3 × 5 catalog card, MARC, CSV, XML, RDF-XML, JSON, Turtle, and others. Billey noted that an important difference between MARC and BIBFRAME is that MARC is both a structure and an encoding standard; BIBFRAME is a structure standard only, allowing for the use of multiple encoding standards. This makes BIBFRAME much more flexible in a rapidly evolving information environment. In addition, Billey suggested that content standards will become increasingly important in this environment, and that information professionals will be able to focus less on managing syntax and more on creating rich descriptions.

A final advantage of linked data that Billey discussed is that it allows for customization in the use of vocabularies. One criticism of BIBRAME, she noted, is that it does not reuse elements of existing vocabularies because the Library of Congress project wanted to control their vocabulary. While this is a valid approach, it does not meet the five star standard. However, Billey continued, “You can make it into five star by creating relationships between elements of different vocabularies.” For example, it can be specified that “title” in BIBFRAME is the same as “title” in Dublin Core. In the Linked Data for Production (LD4P) project in which Billey and Rendall were both involved, they promoted the idea of reusing existing vocabularies to create new ways to describe objects.Footnote8 She predicted that use of multiple vocabularies will be common in whatever future bibliographic utilities evolve. She noted that similarly named elements do not always mean the same thing in different vocabularies, so using multiple vocabularies must be done with care. Billey closed out this segment of the workshop by showing the group the website Linked Open Vocabularies, which includes descriptions and links to more than 650 vocabularies.Footnote9 She described it as a good place to begin getting familiar with different vocabularies.

RDF: The structure standard for linked data

RDF is a W3C standard for data interchange on the web. It is a model for structuring data and relationships. While MARC and relational databases are structured around an element or key, and then a list of things associated with that key, RDF is organized in sets of data called triples. Triples consist of:

< Subject> <Predicate> <Object>

For example:<Virginia Library Association> <Publisher> <Virginia Libraries>

The entities and relationships in triples can be represented by URIs.

In this example:

<http://id.loc.gov/authorities/names/n80023576> <http://purl.org/dc/terms/publisher>

<https://ejournals.lib.vt.edu/valib/index>

That is:

<URI from LC authority file> <URI from Dublin Core Metadata Schema> <URL of the resource>

A Uniform Resource Locator (URL) is a type of URI, although a URI actually represents an entity or relationship, while a URL merely specifies a file location. The example above fully encoded as an RDF triple in XML with Dublin Core looks like:

<?xml version = “1.0”?>

<!DOCTYPE rdf:RDF PUBLIC “-//DUBLIN CORE//DCMES DTD 2002/07/31//EN”

http://dublincore.org/documents/2002/07/31/dcmes-xml/dcmes-xml-dtd.dtd”>

<rdf:RDF xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#

xmlns:dc = "http://purl.org/dc/elements/1.1/">

<rdf:Description rdf:about = “https://ejournals.lib.vt.edu/valib/index”>

<dc:title>Virginia Libraries</dc:title>

<dc:publisher rdf:resource = “http://id.loc.gov/authorities/names/n80023576”>Virginia Library Association</dc:publisher>

</rdf:Description>

</rdf:RDF>

Billey again emphasized that librarians are familiar with relational databases (such as library catalog databases) and hierarchical databases (such as thesauri), where some elements or nodes are more important than others. In contrast, databases created with RDF are called graph databases, and the elements contained in them do not have any intrinsic hierarchy imposed by the data structure. While it is possible, Billey noted, to impose a hierarchy in how relationships are described, “all objects have an equal position with regard to computer processing.” Using linked data, we can indicate relationships between entities and encode them in URIs. We can demonstrate what she called “intellectual linkages” between data beyond what is possible in the MARC environment.

Further, linked data can add contextual information about entities, such as the kind of information found in MARC authority records, and much more. Most of this information can be represented in the form of a stable URI and become machine-actionable. As an example of how this is beneficial, Billey gave the example of when a MARC authority record is updated, the bibliographic file is not updated automatically. In a linked data environment, URIs refer to the updated authority information automatically.

Linked data lexicon

Billey next proceeded to define and explain basic terminology used with linked data. URIs are useful because they can be de-referenced (i.e., when you point to a URI meaningful information can be looked up by a machine). In an ontology, a class is a kind of thing. The subjects and objects of triples belong to classes. Classes are represented by terms that are capitalized. For example, the BIBFRAME class “Work” is represented as bf:Work. A sub-class is sub-type of a thing. For example, there is a BIBFRAME class “Identifier” for which there is a sub-class “ISSN,” represented as bf:Issn. A sub-class is also a class in its own right. A property is a relationship between classes. Properties are the predicates in linked data triples. They are represented by terms beginning with lower-case letters. For example, the BIBFRAME property “title” is represented as bf:title. Billey added that some ontologies use verb forms for properties (e.g., “hasTitle”), but BIBFRAME does not. A sub-property is a sub-type of a property. For example, the BIBFRAME property bf:date has sub-properties bf:copyrightDate and bf:creationDate. A domain describes which individuals of a class that a property may be used with or applied to. Domain is referred to in ontologies with the label “used with” and applies to the subject of a triple. Range refers to the range of values that a property can have and may be thought of as “expected value.” It applies to the objects of predicates. The type of property determines what the valid range of values might be. A datatype property relates only to data and literals. In this context, “literal” refers to a text string, and is generally something that cannot be represented by a URI. Literals are enclosed in quotes. An example is a title, such as:

rdfs:label “The Joy of Cataloging”

An Object type property relates only to objects (i.e, things that are represented by a URI). For example, the Library of Congress Authority File URI for Sanford Berman would be a valid value for the BIBFRAME property “contributor”:

<http://example.com/Work_1> bf:contributor

<http://id.loc.gov/authorities/names/n80107118>

An individual is any kind of thing that can be represented by a URI or an International Resource Identifier. The URI above representing the author Sanford Berman is one example. A blank node, or bnode, is a way to represent a resource or information for which no URI or literal exists. Billey and Rendall both indicated that this concept is often difficult for people to grasp. A blank node can only be used as a subject or an object of an RDA triple. Rendall explained that blank nodes are used frequently in BIBFRAME to capture legacy information from MARC records that cannot be represented with URIs, such as notes like “Includes bibliographical references (p. 240–244).”

Billey moved on to define several relevant terms for theoretical concepts. An open world assumption, which is a term from logic, is “the assumption that the truth value of a statement may be true irrespective of whether or not it is known to be true. It is the opposite of the closed-world assumption, which holds that any statement that is true is also known to be true.”Footnote10 An assertion, which is a computer programming term, is “a statement that a predicate … is expected to always be true at that point in code execution,” or, as Billey interpreted, “Saying something about a thing and expecting it to be true.”Footnote11 Inference is “the process of deriving logical conclusions from a set of starting assumptions. Using Linked Data, existing relationships are modeled as a set of (named) relationships between resources.”Footnote12 One of the most exciting possibilities of linked data is the ability it gives to discern new relationships based on existing data, enabling machine learning. Finally, entailment is a concept from linguistics. It describes a relationship between sentences or phrases that is always true and can be represented as “if A, then B.”Footnote13

Ontology-building exercise

Following a break, Billey and Rendall had participants divide up into small groups of three or four to create ontologies based on families (parents, children, mothers, fathers, sisters, brothers, etc.). They asked participants to consider what kinds of classes they would create, how they would describe relationships, and how they would express those in the form of triples.

Some groups created flatter ontologies with fewer classes and many properties; others created more hierarchical ontologies with more classes and numerous sub-classes. Billey emphasized that both approaches are valid, and just represent different ways of modeling. In fact, she noted that the RDA ontology has fewer classes with many properties, while the BIBFRAME ontology has more classes and sub-classes, with relatively fewer properties. Guided by Billey’s questions, the groups spent some time discussing the classes and properties they chose for their ontologies, what they included, how they organized them, and how they described the relationships between classes. Billey commented that this was the kind of process they engaged in at Columbia University while developing an ontology for art objects for the LD4P project.

The ontology-building exercise wrapped up with a brief introduction to Protégé, a software package for ontology modeling and development. The group first used Protégé to enter their family ontologies. Afterward, Billey demonstrated how the software can be used to examine existing ontologies and led the group in exploring the BIBFRAME ontology.

RDF syntax

Rendall and Billey then showed how elements of ontologies can be used in RDF to describe information resources. RDF can be expressed in different syntaxes, which increases its usefulness. They showed examples of bibliographic information about the journal Virginia Libraries using RDF expressed in RDF/XML, JSON-LD, N-Triples, N3, and Turtle.Footnote14 The following is the example in Turtle:

@prefix dc11: <http://purl.org/dc/elements/1.1/> .

<https://ejournals.lib.vt.edu/valib/index>

dc11:title “Virginia Libraries”;

dc11:publisher <http://id.loc.gov/authorities/names/n80023576>;

dc11:identifier “ISSN 1086-9751”;

dc11:subject <http://id.loc.gov/authorities/subjects/sh85076502>,

<http://id.loc.gov/authorities/names/n79022909>;

dc11:format <http://id.loc.gov/vocabulary/marcgt/per>;

dc11:type <http://purl.org/dc/dcmitype/Text> .

Turtle tutorial

Billey next began a basic tutorial in Turtle. She noted that most linked data from the Library of Congress will be expressed in XML or Turtle. A simple triple is expressed in the form of <subject><predicate><object>, or alternately, <class><property><class>. For example, The Joy of Cataloging by Sanford Berman has a subject, “cataloging.” This can be expressed in a Turtle triple as:

<http://www.worldcat.org/oclc/6626957 >

<http://id.loc.gov/ontologies/bibframe.html#subject>

<http://id.loc.gov/authorities/subjects/sh85020816> .

The first URI links to the WorldCat record for The Joy of Cataloging. The second links to the BIBFRAME ontology property “subject.” The final URI links to the Library of Congress subject authority file class “Cataloging.”

With Turtle, a single subject can have lists of predicate–object statements if there is more than one relationship to express. This avoids the inefficiency of having to repeat the subject for each predicate. When there is more than one predicate for a subject in a Turtle triple, the predicates are separated by semicolons. The statement ends with a period.

<http://www.worldcat.org/oclc/6626957>

<http://id.loc.gov/ontologies/bibframe.html#subject>

<http://id.loc.gov/authorities/subjects/sh85020816>;

<http://id.loc.gov/ontologies/bibframe.html#subject>

<http://id.loc.gov/authorities/subjects/sh85129425>;

<http://id.loc.gov/ontologies/bibframe.html#classification>

<http://id.loc.gov/authorities/classification/Z693.A3-Z693.Z> .

This statement says that The Joy of Cataloging has subject “Cataloging,” has subject “Subject cataloging,” and has classification Z693.A3-Z693.Z.

It is also possible to use multiple objects with the same predicate, in which case the objects are separated by commas: <subject> <predicate> <object>, <object>.

URIs in Turtle are enclosed in angle brackets. They can be absolute (i.e., including the entire file location). They can also be relative (i.e., including the file name relative to a specified directory). In the latter case, the base URI is declared in a statement beginning with “@base” or “BASE” and any URIs that have that base can be expressed in a shorthand where the “#” symbol represents the base URI. For example, if:

@base <http://library.edu/>

then:

<#book_0912700513> is equivalent to <http://library.edu/book_0912700513>

Prefixes are a way to express ontology domains in shorthand form within a document.

They are placed at the head of a document. For example:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

This means that within the document, the prefix “rdf:” points to the RDF ontology. The following example of The Joy of Cataloging title and subtitle expressed in Turtle illustrates the use of prefixes:

@base <http://library.edu/> .

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

@prefix bf: <http://id.loc.gov/ontologies/bibframe.rdf> .

<Berman_work1>

a bf:Work;

bf:hasInstance <Berman_inst1>;

bf:adminMetadata <Berman_admin1;

bf:title [

abf:Title, bf:WorkTitle;

bf:mainTitle“Joy of cataloging”;

bf:subTitle“essays, letters, and other explosions”;

];

In the example, “bf:Work” is a shortcut for expressing the URI for the BIBFRAME ontology class “Work,” http://id.loc.gov/ontologies/bibframe/Work. The notation “a” preceding “bf:Work” signals that it is an RDF type. Datatype properties that require a specific sequence of characters, such as a title, are called literals. Literals are enclosed in quotes. A datatype for a literal may be specified, proceeded by the “^^” notation, for example:

bf:date “2007/2010”^^http://id.loc.gov/datatypes/edtf;

Square brackets are used to nest blank nodes together. Billey explained that BIBFRAME uses this device to provide legacy information that is in the form of literals and needs to be grouped together. It is commonly used with the property provisionActivity, which includes publication information.

Rendall and Billey wrapped up the Turtle tutorial with a few pointers on grammar. White space is ignored, except within a literal. When a literal contains quotes, the literal itself can be set off with three single quotes at the beginning and end. Finally, comments can be added to code by preceding with a hashtag sign (#).

The entire example of The Joy of Cataloging encoded in RDF with Turtle can be viewed at Billey’s GitHub site: https://github.com/amberbilley/PCC_BIBFRAME/blob/master/BF_Berman_Example.ttl.

Linked data tools

Billey introduced participants to a number of software tools that facilitate working with linked data. Text editors are important to use rather than word processing programs because they do not insert extraneous hidden codes. In addition to Notepad++ and Atom, both of which are free and open source, Billey mentioned a text editor called Sublime Text, which is open source but requires a paid license.Footnote15 These text editors include specialized features that can assist in coding.

Data validators and converters are other important tools. Validators are used to check coding for errors, and converters can change code from one language to another. Billey demonstrated four: the W3C RDF Validation Service, EasyRDF Converter, JSON-LD Playground, and IDLab Turtle Validator.Footnote16 However, she said that more are being developed every day.

In addition, a number of library linked data initiatives have developed custom editors for their projects: the Library of Congress (LC)’s BIBFRAME Editor; The Linked Data for Libraries Labs’s VitroLib Editor; Stanford University’s Center for Expanded Data Annotation and Retrieval (CEDAR) Editor and BioPortal; and the University of California Davis’s BIBFLOW Editor. Some library vendors are also beginning to create linked data editors to accommodate BIBFRAME, including Casalini, Ex Libris, Innovative Interfaces, OCLC Online Computer Library Center, Inc. (OCLC) Worldshare Management, and Folio, EBSCO’s open source project.

Billey wrapped up her tour of linked data tools with a brief look at ontology creation software, including Protégé and Karma, and OpenRefine, a data manipulation tool.Footnote17

Serials in BIBFRAME

Beginning the afternoon session, Rendall led participants through a detailed examination of individual RDA elements for serials expressed in the BIBFRAME 2.0 ontology. Selected examples will be illustrated below. He explained that we would look at the mapping used by the Library of Congress to convert their MARC data to BIBFRAME, supplemented by the Cooperative Online Serials (CONSER) mapping, which had not yet been published at the time of the preconference.Footnote18 Because CONSER’s mapping is based on RDA, while LC’s is based on MARC, Rendall noted that there are areas where the two do not mesh perfectly. In addition, he cautioned, LC’s practice in converting existing MARC data may not exactly reflect future practice.

Rendall reminded the group that an ontology includes rules on how classes and properties can be used, and that classes and properties are used to create triples in the form of:

<subject (Class)> <predicate (property)> <object (Class or Literal)>

For classes, the ontology specifies what they can be the object of, that is, “used with,” and what subclasses they have. For properties, it tells us what classes they can have as a subject, and what sub-properties they have.

Rendall drew many of his examples from The International Journal of Korean Art and Archaeology, WorldCat record #15297773, viewed in LC’s “BIBFRAME Comparison Tool.”Footnote19 It is first identified as a BIBFRAME Instance. The URI was created in LC’s MARC to BIBFRAME conversion, and is not an active URI. Rendall emphasized that, in a live environment, this would be a real URI.

<http://bibframe.example.org/15297773#Instance>

a bf:Instance

The URI represents the Instance derived from MARC record #15297773. The token “a” is shorthand for rdf:type, and is used to state that the resource is an instance of a class.

Because there is not an authority file to provide URIs for titles, those are expressed with blank nodes using literals. For example, the title for display purposes is encoded as:

<http://bibframe.example.org/15297773#Instance>

bf:title

[a bf:Title;

rdfs:label “The international journal of Korean art and archaeology."]

This says that the instance has a title, the human readable version (rdfs:label) of which is the text string enclosed in quotes (literal). Predicates giving more title information, such as sorting title (property bflc:titleSortKey) and title proper (property bf:mainTitle) can be added under the same subject, separated by semicolons:

<http://bibframe.example.org/15297773#Instance>

bf:title

[a bf:Title;

rdfs:label “The international journal of Korean art and archaeology.”;

bflc:titleSortKey “international journal of Korean art and archaeology.”;

bf:mainTitle “The international journal of Korean art and archaeology"]

In cases where titles proper include part name and part number, those can similarly be expressed using the properties bf:partName and bf:partNumber with accompanying literals. While subtitles are usually not transcribed in current serials cataloging, they may be expressed with the property bf:subtitle and accompanying literal if needed. Variant titles, including minor title changes, use the class bf:VariantTitle, which is a subclass of bf:Title:

<http://bibframe.example.org/15297773#Instance>

bf:title

[a bf:Title, bf:VariantTitle;

rdfs:label “Korean art and archaeology”;

bflc:titleSortKey “Korean art and archaeology”;

bf:mainTitle “Korean art and archaeology"]

When more than one object is used (bf:Title and bf:VariantTitle), they are separated by a comma. Rendall said that variant titles can have all the same properties as titles. Minor title changes can be qualified with the property bf:date.

Serial numbering has been handled in the MARC environment in more than one way. Records with formatted numbering (MARC tag 362, first indicator 0) are encoded with properties bf:firstIssue and bf:lastIssue:

<http://bibframe.example.org/11315714#Instance>

bf:firstIssue “Vol. 187 (1896)”;

bf:lastIssue “v. 233”

While enumeration and chronology are combined into one literal, Rendall noted that CONSER is interested in finding a “more granular way of handling this information.” He also pointed out that RDA calls for separate recording of enumeration and chronology for first and last issues. Current CONSER practice prescribes using notes (MARC tag 362, first indicator 1) to record numbering. That practice can be encoded using the properties bf:note and bf:noteType:

<http://bibframe.example.org/15297773#Instance>

bf:note

[a bf:Note;

rdfs:label “Began with v. 01 (2007); ceased with v. 04 (2010).”;

bf:noteType “Numbering"]

Publication information may be handled in various ways. At the minimal level, the publication statement can be transcribed as a single string:

<http://bibframe.example.org/15297773#Instance>

bf:provisionActivityStatement “Seoul: National Museum of Korea, 2007-©2010.”

More usefully, the different elements may be separated and identified (bf:agent, bf:date, and bf:place):

<http://bibframe.example.org/15297773#Instance>

bf:provisionActivity

[a bf:ProvisionActivity, bf:Publication;

bf:agent [a bf:Agent;

rdfs:label “National Museum of Korea"];

bf:date “2007-©2010”;

bf:place [a bf:Place;

rdfs:label “Seoul"]]

However, some of these data could be conveyed with URIs. The following is converted from MARC 008 date and country fields, and illustrates how some publication data can be expressed in a machine-actionable way:

<http://bibframe.example.org/15297773#Instance

bf:provisionActivity

[a bf:ProvisionActivity, bf:Publication;

bf:date “2007/2010"^^<http://id.loc.gov/datatypes/edtf>;

bf:place <http://id.loc.gov/vocabulary/countries/ko>]

In this example, the date type URI links to the “Extended Date/Time Format Datatype Scheme” at LC’s Linked Data Service. The place URI links to “Korea (South)” in the MARC List of Countries. With all of the publication information examples, Rendall noted that the property bf:provisionActivity can be used for publication, production, and distribution information.

All serials will have mode of issuance indicated, with the URI linking to the “MARC Issuance List” term at LC’s Linked Data Service:

<http://bibframe.example.org/15297773#Instance>

bf:issuance <http://id.loc.gov/vocabulary/issuance/serl>

Rendall indicated that serial type (MARC 008 fixed field element) can be indicated with property bf:genreForm, with URI linking to LC’s MARC Genre Terms List.

Frequency may be expressed as a blank node, or with an URI, with the URI linking to LC’s list of publication frequencies:

<http://bibframe.example.org/15297773#Instance>

bf:frequency

[a bf:Frequency;

rdfs:label “Annual"],

<http://id.loc.gov/vocabulary/frequencies/ann>

Rendall added that the MARC 008 regularity value has been converted into a frequency label (e.g., rdfs:label “regular”).

ISSN is expressed with bf:Issn, a subclass of bf:Identifier. Rendall noted that only the subclass is included in LC’s conversion. Currently, it is coded as a blank node with an accompanying value, although in the future URIs may be available for ISSN from the ISSN International Centre.

<http://bibframe.example.org/15297773#Instance>

bf:identifiedBy

[a bf:Identifier, bf:Issn;

rdf:value “2005-1115"]

Relationships may be expressed from instances to other instances, or to the related work, or to a related item:

<http://bibframe.example.org/15297773#Instance>

bf:otherPhysicalFormat <http://bibframe.example.org/15297773#Instance776-41> .

<http://bibframe.example.org/15297773#Instance>

bf:instanceOf <http://bibframe.example.org/15297773#Work> .

<http://bibframe.example.org/15297773#Instance>

bf:hasItem <http://bibframe.example.org/15297773#Item050-13>

Some elements of description refer to the work level, rather than the instance (i.e., manifestation). Records are not created for works in the MARC environment, but elements relating to the work can be included in linked data. The following example shows a work title:

<http://bibframe.example.org/15297773#Work>

bf:title

[a bf:Title;

rdfs:label “The international journal of Korean art and archaeology.”;

bflc:titleSortKey “international journal of Korean art and archaeology.”;

bf:mainTitle “The international journal of Korean art and archaeology"]

It is structured the same as an instance title, except that subtitles are not used with works in RDA. If the authorized access point for the work includes a qualifier, the qualifier is included in the literal.

Content type is an element that refers to the work level, while media type and carrier type are used with the instance. The URI in this example links to “Content Types” in LC’s Linked Data Service:

<http://bibframe.example.org/15297773#Work>

bf:content

<http://id.loc.gov/vocabulary/contentTypes/txt>

Subjects also apply to the work level. Linked data have the potential to use URIs to link out to subject schemas published online. While the Library of Congress Subject Headings have been published as linked data, records converted from MARC include pre-coordinated strings of subject terms for which URIs cannot be created. The following example illustrates how a triple linking with URIs to subjects for a work would be structured. The objects are separated by commas:

<http://bibframe.example.org/15297773#Work>

bf:subject <http://bibframe.example.org/15297773#Topic650-28>,

<http://bibframe.example.org/15297773#Topic650-30>,

<http://bibframe.example.org/15297773#Topic650-31>

Testing and experimentation

After examining the RDA elements illustrated above and numerous others, Rendall turned to a discussion of various projects that have been carried out to test linked data and related tools.

LC conducted the initial BIBFRAME Pilot Phase One from August 2015 to March 2016. In this project, forty LC catalogers led by instructors from the Cooperative and Instructional Programs Division used the BIBFRAME Editor to create bibliographic descriptions for multiple formats in multiple languages. BIBFRAME Pilot Phase Two began in June 2017 and is ongoing. Twenty-three catalogers were added, and further testing involving non-Roman scripts and work with authority descriptions for agents. Phase Two catalogers are working in a live database.Footnote20

CONSER formed a BIBFRAME Task Group in December 2015, which subsequently became a subgroup of the Program for Cooperative Cataloging (PCC) BIBFRAME Task Group in August 2016. Products of the Task Group include a CONSER Standard Record to BIBFRAME mapping, and a final report with a set of recommendations, both issued in July 2017.Footnote21 The final report found that BIBFRAME has the functionality to describe serials adequately; moreover, it has greater potential than the MARC environment for showing relationships among serials. They did find, however, that there were still issues to address.

BIBFRAME is not good at handling changes in the description required by changes in the serial. Examples include changes in frequency, the availability of new information about the title, and errors that need to be corrected. For the first, it would be helpful to include starting and ending dates of frequencies. New information and correction of errors would require deletion or deprecation of existing triples and creation of new ones. Rendall noted that it is not yet clear how this would be done.

Another issue is with literal versus machine-actionable data. Current modeling uses many literals in order to convert legacy data from MARC records fully. In addition, cataloging rules still require exact transcription in many cases. However, the same information captured in URIs is much more useful. The question remains of how long both approaches should be used.

There are also questions about how enumeration and chronology of first and last issues are expressed. Current practice uses literals with both, but there is the potential to express the numeric/alphabetic designation separately from the chronological designation of an issue, as called for in RDA. There is even the possibility of having URIs for first and last issues, “description based on” and “latest issue consulted” notes, and other issue-specific information.

Perhaps more importantly, there are significant differences among the major bibliographic conceptual models and the relationships they define. FRBR has four levels (work, expression, manifestation, item), while BIBFRAME has three (work, instance, item). The Library Reference Model recently produced by the International Federation of Library Associations has only two for serials (work/expression/manifestation and item). Moreover, Rendall noted, there is a lingering perception among serials catalogers that all of these models were developed based on monographs, and serials are an imperfect fit. Forcing serials to fit the models does not always seem to bring significant benefits.

The Task Group report discussed a number of other specific issues, including how to handle administrative-type metadata, such as “description based on” and “latest issue consulted notes,” and whether these types of information would even be meaningful in a linked data environment; use of specific RDA vocabularies for values; treatment of series and their numbering; and others.

The report also included specific recommendations, eight to CONSER/PCC, and four aimed at BIBFRAME development. Recommendations to CONSER/PCC included: (1) explore ways to accommodate the need to make changes in serial descriptive data; (2) present dates as “typed literals” that are machine-actionable; (3) provide machine-actionable data in addition to RDA-mandated transcriptions whenever possible; (4) CONSER and BIBFRAME collaborate to develop a common structure of enumeration and chronology data that can be used in a variety of contexts within serial descriptions; (5) explore PRESSoo and other linked data vocabularies that may provide better methods to model changes in bibliographic information and enumeration and chronology; (6) continuously monitor and analyze the serials landscape, and create a task group to carry this out; (7) identify necessary administrative and provenance metadata and develop methods and best practices to record it at the assertion level; and (8) use value vocabularies from the RDA registry for the frequency, notes, and content, media, and carrier types.Footnote22

Recommendations for BIBFRAME development included: (1) explicitly model start and end dates for descriptive elements to work toward accommodating the need to reflect changing descriptive information; (2) CONSER and BIBFRAME collaborate to develop a common structure for representing enumeration and chronology data that can be used in a variety of contexts (same as #4 above); (3) identify necessary administrative and provenance metadata and develop methods and best practices to record it at the assertion level (same as #7 above); and (4) define properties to express the relationships “augmented by (work)” and “complemented by (work).”Footnote23

Rendall wrapped up this session with a brief introduction to the LD4P, a cooperative venture conducted by Columbia, Cornell, Harvard, Princeton, and Stanford universities, along with LC.Footnote24 This project, which ran from 2016 to 2018, sought to find ways to bring linked data into practical library use. They worked to develop standards, guidelines, and infrastructure, as well as practical workflows for use in a technical services production environment. In addition, they worked to extend the BIBFRAME ontology to nontextual materials such as art objects, cartographic and geospatial resources, moving images, music, and rare materials. Finally, they sought greater engagement in linked data development in the wider library community.

In the final session of the day, Billey and Rendall led participants through hands-on exercises in which we examined BIBFRAME tools and created RDF statements for serials. First, Rendall led the group in looking at LC’s BIBFRAME Comparison Tool. As previously noted, most of the examples of serial elements in RDF in the previous session were derived using this tool. It allows lookup of any LC MARC record, which can then be viewed in RDF serialized in either Turtle or RDF/XML. Rendall used this as a starting point to ask participants to think about the ways that future catalogers will work. While working interfaces may be developed, both Rendall and Billey expressed the belief that the future linked data environment may well require catalogers to interact with data at the code level more than the current environment does.

Next, Billey led an exercise in which participants created BIBFRAME metadata for a serial title. The title The Journal of Problem Solving was selected from a list of Open Access titles, and Billey provided a CONSER BIBFRAME template from her GitHub site. When the exercise was complete, Billey used Turtle Validator to test the code and make corrections.

The day wrapped up with a discussion about BIBFRAME editors and future directions for linked data in libraries. Rendall asked what features would be desirable in editors. He noted that there are about sixteen elements that are always used, but vastly more that could be added in a given situation. With an editor that has a limited number of preset text boxes for entering data, how can extra data elements be added? One participant, who works with data in digital repositories, said that the editors she has used in those environments seem to be much easier to work with than current BIBFRAME editors. Billey agreed that she had encountered more powerful and flexible editors as well, including CEDAR, which was developed at Stanford for the biomedical community.

Finally, Rendall asked what participants thought of it all, after seeing some background, and hearing about some library-led projects. What, he asked, needs to be done to move forward, “from a practical point of view, a cataloging point of view, a serials point of view?” One participant responded that because so much of the work of catalogers is shaped by the tools we use—OCLC, library systems, and so on—it is difficult to see how linked data will work on a practical level without seeing an interface view. Looking only at the code, it is hard to imagine how production will work for the ordinary cataloger. Rendall responded that there are finally some first-generation sandboxes with which to begin to work. Billey strongly agreed that linked data will not take off for libraries until there are tools with which we can work. More importantly, she emphasized, the library community should look for opportunities to use linked data to do more than we can do in the MARC environment. Rendall added that MARC cannot go on forever, so something will have to be implemented to replace it. He granted that the current period has been difficult, characterized by starts and stops, with high expectations but slow progress. He counseled patience, and getting involved with linked data projects to help move the library community forward. Billey urged participants to go forward with the concept that “if you own a car, it’s important to know how to change your own oil.”

The workshop was dense with technical information—from generic programming concepts to ontology construction to analyzing specific assertions about a serial title in RDF/Turtle. It was clarifying in that it took a topic previously completely abstract to many participants and made it much more concrete. In addition, the presenters gave participants a good environmental scan on the state of linked data in the library world.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Notes on contributors

Amber Billey

Amber Billey is Systems and Metadata Librarian, Bard College, Annandale-on-Hudson, New York.

Robert Rendall

Robert Rendall is Principal Serials Cataloger, Columbia University, New York, New York.

Kathryn Wesley

Kathryn Wesley is Continuing Resources and Government Documents Librarian, Clemson University, Clemson, South Carolina.

Notes

1. Stanford Center for Biomedical Informatics Research, Protégé (Stanford University, 2016), http://protege.stanford.edu (accessed July 17, 2018); Don Ho, Notepad++, https://notepad-plus-plus.org/ (accessed March 8, 2019); Atom, https://atom.io/ (accessed July 17, 2018).

2. Natalya F. Noy and Deborah L. McGuinness, Ontology Development 101: A Guide to Creating Your First Ontology (Stanford, CA: Stanford University, 2001), http://protege.stanford.edu/publications/ontology_development/ontology101.pdf (accessed July 17, 2018); Guus Schreiber and Yves Raimond, eds., RDF 1.1 Primer (W3C, 2014), https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/ (accessed July 17, 2018); David Beckett et al., RDF 1.1 Turtle: Terse RDF Triple Language (W3C, 2014), https://www.w3.org/TR/turtle/ (accessed July 17, 2018).

3. Tim Berners-Lee, “Linked Data,” (2006), last modified 2009, https://www.w3.org/DesignIssues/LinkedData.html (accessed July 17, 2018).

4. Ibid.

5. DBpedia, https://wiki.dbpedia.org/ (accessed March 8, 2019); Wikidata, https://www.wikidata.org/ (accessed July 17, 2018); GeoNames, http://www.geonames.org/ (accessed July 17, 2018).

6. Insight Centre for Data Analytics, “The Linked Open Data Cloud,” https://lod-cloud.net/ (accessed July 17, 2018).

7. Library of Congress, “BIBFRAME,” https://www.loc.gov/bibframe/ (accessed July 28, 2018).

8. Linked Data for Production (LD4P), last modified October 21, 2018, https://wiki.duraspace.org/pages/viewpage.action?pageId=74515029 (accessed July 28, 2018).

9. Linked Open Vocabularies (LOV), https://lov.linkeddata.es/dataset/lov (accessed July 28, 2018).

10. Wikipedia, “Open-World Assumption,” last modified March 5, 2019, https://en.wikipedia.org/wiki/Open-world_assumption (accessed July 28, 2018).

11. Wikipedia, “Assertion (Software Development),” last modified February 11, 2019, https://en.wikipedia.org/wiki/Assertion_(software_development) (accessed July 28, 2018).

12. Bernadette Hyland et al, eds., “Linked Data Glossary” (W3C, 2013), https://www.w3.org/TR/ld-glossary/ (accessed July 28, 2018).

13. “Entailment (Linguistics),” last modified January 10, 2019, https://en.wikipedia.org/wiki/Entailment_(linguistics) (accessed July 28, 2018).

14. Wikipedia, “RDF/XML,” last modified February 13, 2019, https://en.wikipedia.org/wiki/RDF/XML (accessed August 5, 2018); JSON-LD, “JSON for Linking Data,” https://json-ld.org/ (accessed August 5, 2018); Wikipedia, “N-Triples,” last modified October 27, 2018, https://en.wikipedia.org/wiki/N-Triples (accessed August 5, 2018); Tim Berners-Lee and Dan Connolly, Notation3 (N3): a Readable RDF Syntax (W3C, 2011), https://www.w3.org/TeamSubmission/n3/ (accessed August 5, 2018); David Beckett et al., RDF 1.1 Turtle.

15. Sublime Text, https://www.sublimetext.com/ (accessed August 15, 2018).

16. Eric Prud’hommeaux, “W3C RDF Validation Service,” W3C, last modified February 28, 2006, https://www.w3.org/RDF/Validator/ (accessed August 15, 2018); Nicholas Humfrey, “EasyRDF Converter,” http://www.easyrdf.org/converter (accessed August 16, 2018); JSON-LD, “JSON-LD Playground,” https://json-ld.org/playground/ (accessed August 15, 2018); IDLab Turtle Validator, http://ttl.summerofcode.be/ (accessed August 15, 2018).

17. University of Southern California, Information Sciences Institute, Center on Knowledge Graphs, “Karma: A Data Integration Tool,” (2016), http://usc-isi-i2.github.io/karma/ (accessed August 15, 2018); OpenRefine, http://openrefine.org/ (accessed August 15, 2018).

18. Library of Congress, “MARC 21 to BIBFRAME 2.0 Conversion Specifications,” https://www.loc.gov/bibframe/mtbf/ (accessed January 27, 2018).

19. Library of Congress, “BIBFRAME Comparison Tool,” http://id.loc.gov/tools/bibframe/compare-id/full-ttl?find=15297773 (accessed January 27, 2018).

20. Library of Congress, “BIBFRAME Training at the Library of Congress,” https://www.loc.gov/catworkshop/bibframe/ (accessed January 27, 2019).

21. Program for Cooperative Cataloging, CSR to BIBFRAME Mapping (Library of Congress, 2017), http://www.loc.gov/aba/pcc/bibframe/TaskGroups/CSR-PDF/CSRtoBIBFRAMEMapping.pdf (accessed February 10, 2019); Program for Cooperative Cataloging, Report to the PCC BIBFRAME Task Group (Library of Congress, 2017), http://www.loc.gov/aba/pcc/bibframe/TaskGroups/CSR-PDF/FinalReportCONSERToPCCBIBFRAMETaskGroup.pdf (accessed February 10, 2019).

22. Ibid.

23. Ibid.

24. “Linked Data for Production (LD4P).”