27
Views
0
CrossRef citations to date
0
Altmetric
Technical service reports

Report of the CORE Faceted Subject Access Interest Group, ALA CORE Virtual Interest Group Week, March 2024

ORCID Icon & ORCID Icon

The CORE Faceted Subject Access Interest Group (FSAIG) organized three presentations during the ALA CORE Virtual Interest Group Week on March 5, 2024. These presentations delved into topics such as Library of Congress Demographic Group Terms (LCDGT), Library of Congress Genre/Form Terms (LCGFT), and a case study highlighting the conversion to faceted subjects for a digital collection during system migration. The meeting garnered positive feedback, drawing 365 attendees with active Q&A sessions.

Converting our digital collections’ legacy non-faceted subjects; or, how we learned to stop worrying and love the facets

Rebecca Saunders, Cataloging and Metadata Librarian at Western Carolina University’s Hunter Library, our first speaker, introduced a recently launched digital platform, Southern Appalachian Digital Collections (SADC) that hosts collections from two different institutions, Western Carolina University (WCU), and the University of North Carolina Asheville (UNCA). Her presentation focused on the subject clean-up from pre-coordinated subject headings to faceted subjects as reflected in the title, “Converting our digital collections’ legacy non-faceted subjects; or, how we learned to stop worrying and love the facets.” During her presentation, Rebecca articulated the project’s mission to evolve into a regional digital collections hub.

To achieve this goal, the digital collections platform was conceptualized to serve as a repository for numerous institutions within the region. Essential to this endeavor was the establishment of a consistent and standardized set of vocabularies to describe the amalgamated digital collections from various institutions, both present and future. The platform was launched in April 2022, following the migration of legacy collections from CONTENTdm to Qi, a novel content management system developed by Keepthinking, which represents a departure from traditional library systems. Collaboration among the digital scholarship librarian, special and digital collections, cataloging, and metadata teams facilitated vocabulary remediation projects conducted across three phases: (1) cleaning up topical subjects imposing Library of Congress Subject Headings (LCSH) formats; (2) cleaning up local controlled vocabularies expanding to those in Art and Architecture Thesaurus (AAT) and Thesaurus for Graphic Materials (TGM); (3) cleaning up name heading according to Library of Congress Name Authority File (LCNAF) format.

These resulted in decreasing the legacy vocabularies from 12,876 terms to 7,534 terms. Both institutions have historically utilized pre-coordinated and subdivided subject vocabularies. As they began exploring the benefits of implementing filters for faceted searching within the new system, they recognized that subdivisions could impede discovery, particularly in systems reliant on search filters. Consequently, their longstanding dedication to non-faceted subjects started to wane. In 2023, they embarked on a significant project aimed at converting their extensive non-faceted vocabularies into faceted ones. One of their primary concerns during the initial phases of transitioning to faceted vocabularies was the potential loss of information inherent in the subdivisions of their non-faceted vocabularies. Designated as phase four, this project commenced in the late summer of 2023, converting subjects with geographic and chronological subdivisions. However, converting topical subdivisions is yet to be fully integrated as they demand a more nuanced approach. The process includes detailed record-keeping for each term, enhancing workflow efficiency. The Qi term reduction tool facilitates the merging of like terms, streamlining the conversion process. This approach improves discovery by aggregating related objects under unified faceted terms and maintains access to specific information through filters.

Feedback from special and digital collections colleagues has been positive, highlighting improved usability and discovery. Plans for a faceted search guide aim to further assist users in navigating the digital collections platform, ensuring that the conversion from a non-faceted to a faceted vocabulary retains the depth of information while enhancing user experience.

Conducting a pilot for library of congress demographic group terms

The second speaker was Elizabeth Hobart, Interim Head of Cataloging and Metadata Services at Penn State University Libraries, who discussed a pilot study of Library of Congress Demographic Group Terms (LCDGT) at her library. The purpose of the pilot was to gauge how LCDGT could improve the discovery of resources created by historically marginalized groups.

She began with a brief history of the LCDGT and how the vocabulary is being used in the MARC 386 field. Hobart stated that LCDGT could help highlight attributes of authors otherwise buried in subject headings such as “Poetry, American – Women authors.” She then talked about the pros and cons of bringing out attributes of authors through LCDGT. Benefits include easier discovery of materials by diverse creators (i.e., the focus of this study) and the support for linked data because of LCDGT’s faceted nature. On the other hand, highlighting attributes including sexual orientation and gender identity may cause unintended consequences such as outing, othering, privacy violation, misidentification due to fluidity of identity, etc.

Since LCDGT is relatively new (created in 2015) and utilization of the MARC 386 field is spotty, the limited application of LCDGT in bibliographic records may also create a false impression on holdings in a library by certain demographic groups. After the background information, Hobart went into details of the pilot study. She identified a sample set of 500 records from Penn State Libraries catalog based on several criteria. The samples represented creators from a wide range of demographic groups, with 20% of the selected records being anthologies, and included materials coming from collections devoted to works by certain demographic groups as well as works by writers who self-identify as members of a particular group. Then, catalogers at Penn State assigned LCDGT to this sample set of records using the MARC 386 field. In total, they assigned 1,615 terms from LCDGT. Among them, 264 terms were unique. Given there were copy records with LCDGT already assigned in Penn State’s catalog, Hobart expanded the scope to include 2,033 copy records and made the total sample size 2,533. In this expanded sample set, she found that 6,942 terms were used, and 630 of them were unique. Eight of them were used more than 100 times, while 296 terms were only used once.

To assess how these terms can affect discovery, the technology team at Penn State built the “Contributor demographic” search box with a dropdown list of demographic group terms in the test instance of their catalog. The dropdown only included terms that were used at least five times to avoid overwhelming users with all 630 terms. With the search box added, users could search for titles written by a particular group of authors by selecting the desired term in the dropdown and optionally combining that with other search criteria like branch location. The search box also allowed users to select multiple terms from the dropdown and perform an “OR” Boolean search. Moreover, users could further filter the search results with facets available in Penn State’s catalog. For example, one could find science fiction written by either “Black people,” “Indigenous people,” or “Transgender people” by selecting the “Science fiction” option in the “Genre” facet after searching for the above three demographic group terms. Through this new search option, users could discover titles that were previously unknown to them. However, the recall was not perfect due to inconsistent demographic group term data in this record set. Hobart found that extra trailing periods, inconsistent use of plurals and diacritics, and revisions of the authorized form of individual LCDGT terms had caused split files. Catalogers might also have misapplied terms due to confusion or lack of understanding of the scope of similar terms, incorrect assumption of an author’s gender, as well as uncertainties over specificity of geographic terms used. That said, Hobart still considers the addition of LCDGT to have generally had a positive impact on discovery. She planned to begin a new phase of the study and might expand to other vocabularies like Homosaurus. Data soon to be collected would serve as a basis to determine if and how Penn State Libraries would make the assignment of demographic group terms a permanent practice.

Looking for literature in the library

The last speaker was Kelley McGrath, Metadata Management Librarian at the University of Oregon. She talked about an ongoing project by the American Library Association Core Subject Access Committee Subcommittee on Faceted Vocabularies (SSFV) to retrospectively add LCGFT to literary titles.

Since LCGFT is a faceted vocabulary, McGrath briefly went over the benefits of facets on known item searches as well as browsing and exploratory searches. To support faceted searching experience for library users, she commented that new faceted vocabularies like LCDGT would need to go hand in hand with discovery interfaces that could utilize new and existing MARC fields designed to accommodate these vocabularies. Given that faceted vocabularies were implemented relatively recently, inconsistent application of these facets in existing records created recall obstacles. Libraries could tackle it by increasing the assignment of faceted vocabularies in current cataloging and retrospectively adding these terms to legacy records. Due to the sheer quantity of legacy records, retrospective assignment must be as automated as possible. Per McGrath, there are three crucial elements for the success of this kind of work: 1) a well-defined scope of records that need enhancement, 2) mapping of data points already in existence to target faceted terms, and 3) technological tools like macros and scripts that could automate the process. She then illustrated the process by describing an initiative in the music cataloging community to map LCSH to LCGFT and Library of Congress Medium of Performance Thesaurus (LCMPT). Candidate records for enhancement were identified by their record type value encoded in MARC Leader field position 06 (LDR/06) and a relevant LCSH heading. The assignment of LCGFT and LCMPT terms were based on a 52-page-long mapping document developed by the community. To carry out the assignment in an automated fashion, they used the Music Toolkit developed by Gary Strawn in collaboration with the Music Library Association. The tool analyzed the components of LCSH headings in the record and assigned new LCGFT/LCMPT terms based on the mapping.

In a similar vein, a task group of SSFV, of which McGrath is a member, is developing logic to support tools that will recommend or assign LCGFT and other faceted terms according to existing LCSH and other relevant data points in legacy records for literature. McGrath talked about her efforts to identify LCSH that could potentially be mapped to LCGFT. She first downloaded a copy of LCSH from LC’s Linked Data Service website (id.loc.gov). Then, she used a Python script and several libraries like BeautifulSoup and Pandas to parse, evaluate, and format the downloaded LCSH XML file into tabular format. She tried to identify possible LCSH literature headings by keywords as well as broader and narrower terms in LCSH’s syndetic structure. For example, a keyword search of “poetry” could bring up both “African poetry” and “Magic and poetry.” It is obvious that the former is relevant to this mapping project, while the latter is not. McGrath, therefore, employed the “inclusion and exclusion” technique to filter out false hits. On the other hand, “poetry” has numerous narrower terms, and a lot of them do not have the word “poetry” in them. By tracing up and down the hierarchy of a relevant LCSH term, she could find other relevant terms that would otherwise be missed by a keyword search. Given an LCSH term can belong to multiple tree structures, one cannot unquestioningly include all terms in the hierarchy as relevant. For example, “Ballads” and “Songs” are narrower terms of both “poetry” and “Vocal music.” It would be inappropriate to accept any other narrower terms under “Vocal music” without scrutiny. Currently, the SSFV task group is working on disentangling various parts of LCSH terms identified in the above process. For example, “Literature, Ancient” has a chronological modifier “Ancient” applied to the base genre/form “Literature.” Demographic group, language, and place are the most common types of modifiers. Modifiers will need to be mapped separately from the genre/form part down the road. Besides identifying the genre/form part and its modifier, the task group needs to determine how these two parts are put together to allow the tools to parse the LCSH terms.

Besides using LCSH as the source data for mapping, other data points like the “Literary form” fixed field, uniform title, subtitle, note, classification number, and form subdivisions in subject strings are potential candidates. However, due to changes in cataloging rules and practices over time and input errors by catalogers, mappings based on these data points tend to be error-prone. Instead of being limited by metadata with questionable quality, McGrath suggested that training artificial intelligence to identify literary forms and genres from full text might be a viable solution in the future.