Category Archives: Standards Subcommittee

TARO community feedback about EAD Best Practice Guidelines

Hello TARO Community,

Earlier this year the TARO Standards Subcommittee shared with you that we have been tasked with assessing the level of adoption of the TARO Best Practice Guidelines, among other things. You can check out the Standards wiki page to learn about what our group has been doing this past year.

This message is to ask you to share your thoughts with us. Let us know what you think of the TARO Best Practice Guidelines, both what you like and what you don’t like. We want to know if you are using the best practice guidelines document, and if you are not, to learn why.

You can provide your feedback by completing this brief survey.

You don’t have to be your repository’s designated TARO representative. If you create finding aids that are added to TARO, we want to hear from you!

Please complete the survey by Friday, November 3.

TARO Index Terms survey results

Dear TARO community,

Thank you all who participated in the Index Terms survey we sent out in May. We are writing to share with you the results of the survey and to let you know that the TARO Steering Committee will further investigate/peruse the Metadata Hopper software option based on feedback received from survey. See below for survey results.

Question 1: Is your repository able to check all Index Terms (Names, Subjects, Document Types, and Titles) in the <controlaccess> section in your EAD files before uploading to TARO to verify that the terms match the authorized version of the term and that the encodinganalog and source attribute values are assigned correctly?

The majority of responders indicated that they do check all their terms or at least try to verify them. Some are not able to check them and others are not sure whether this is done before upload to TARO.

 

Question 2: Do you believe your repository has the necessary resources (staff time and expertise) to retrospectively review your EAD files’ <controlaccess> terms and edit them as needed to use the authorized vocabulary terms if TARO provided a report of what terms need to be updated?

 Several people indicated that they believe their institution has the resources or can plan for the next fiscal year to embark on a project to do a data clean-up project. The comments indicate that people could do a cleanup project (after TARO provided a report) as long as there was ample time (no tight deadlines); responders asking for flexibility.

 

Question 3: In addition to controlled lists of local terms would it cause practical/logistical problems for your institution if TARO decided to require that EAD files use specific controlled vocabularies, such as Library of Congress Name Authority File and Subject Headings, and Getty Art & Architecture Thesaurus, for <controlaccess> terms going forward?

 Most people replied that they do not foresee any problems if TARO required a controlled vocabulary but asked for the flexibility of still being able to use some local terms that follow controlled vocabulary conventions (e.g. follow LCNAF conventions to create an entry). Again people ask for flexibility and/or training (specifically referring to AAT).

 

Question 4:  Would your repository be willing to have its TARO finding aids sorted into broad TARO subject categories to enhance user experience in browsing? For an example, see the Chicago Collections site (http://explore.chicagocollections.org/)

Good news! Nearly everyone who responded said their institution would be willing to have broad subject terms applied to their finding aids. There was one blank response and one person said that this would be okay as long as there was no additional work incurred on their staff.

 

Question 5: Do you have additional comments or questions?

Most people had no comments. Five comments were submitted (mainly from Steering Committee members).

 

Best,

TARO Steering Committee

Authority Control at TARO: Common Encoding Issues

Last week I posted the first (summary) section of the report I wrote about the use of EAD <controlaccess> index terms by TARO’s forty-plus contributing repositories. The second section of the report, below, outlines some of the more frequent encoding inconsistencies and problems, issues that make difficult the automated aggregation of terms necessary for faceted browsing/navigation. —Tim Kindseth


 

No values

In over 400 instances, a <controlaccess> element was used with null values. In other cases, the value is populated with placeholder text resembling encoder comments, which is likely residue from an EAD template.

  • <persname></persname>
  • <persname>NAME (SPECIFY SOURCE, ADD MORE AS NEEDED)</persname>

 

Syntax (of attributes)

EAD does not require either the @encodinganalog or @source attribute to appear before or after the other. Inconsistent syntax, though, makes it extremely difficult to extract data for analysis and normalization.

  • <persname encodinganalog=”600″ source=”lcnaf”>Ferguson, Miriam Amanda, 1875-1961.</persname>
  • <persname source=”lcnaf” encodinganalog=”600″>Ferguson, Miriam Amanda, 1875-1961.</persname>

 

Periods

LCSH and LCNAF values, when properly written, end in a period. Whether or not TARO wishes to retain this convention, terms should be constructed either with or without an ending period, not both ways.

  • <persname encodinganalog=”600″ source=”lcnaf”>Ferguson, Miriam Amanda, 1875-1961.</persname>
  • <persname encodinganalog=”600″ source=”lcnaf”>Ferguson, Miriam Amanda, 1875-1961</persname>

 

Dashes & spaces

Value subdivisions are sometimes separated by two dashes with no spaces between the dashes and values, or two dashes with a space between the dashes and values; at other times the subdivisions are delineated by an em dash with (or without) spaces between the dash and values.

  • <subject>Mexican Americans––Civil rights––Texas.</subject>
  • <subject>Mexican Americans –– Civil rights –– Texas.</subject>
  • <subject>Mexican Americans—Civil rights—Texas.</subject>
  • <subject>Mexican Americans — Civil rights — Texas.</subject>

 

Element confusion

 With place names in particular, Library of Congress subject headings are often encoded incorrectly as <geogname> control access terms. Many authorized Library of Congress subject headings are built by appending a time period or subject to a city or country name, which may explain why what is technically a subject (Dallas (Tex.)––History.) so often ends up being encoded as a geographic name. EAD3 (discussed later) allows for the parsing of encoded values and may help eliminate this confusion.

  • <geogname>Houston (Tex.)––History.</geogname>
  • SHOULD BE <subject>Houston (Tex.)––History.</subject>
  • OR <geogname>Houston (Tex.)</geogname>

 

Contradictory/dissimilar values

A set of birth and death years might appear within one <persname> element while a different set (or none at all) appears in another, even though both occurrences refer to the same individual. This happens both across and within repositories.

  • <persname>Moore, Charles Willard, 1925-1993</persname>
  • <persname>Moore, Charles Willard, 1925-1992</persname>
  • <persname>Lipscomb, Mance</persname>
  • <persname>Lipscomb, Mance, 1895-1976<persname>

 

Encoding levels

EAD2002 allows <controlaccess> terms to be nested within a main <controlaccess> heading. Repositories sometimes include <controlaccess> elements within this top level, sometimes one level down, and sometimes at both levels. When extracting TARO’s 153,000 index terms, BaseX queries thus had to be performed at two levels. This could cause unnecessary problems for a script that attempts to cull all <controlaccess> instances for display during search and retrieval.

  • <controlaccess><head>Index Terms</head><corpname>Daughters of the Republic of Texas.</corpname></controlaccess>
  • <controlaccess><head>Index Terms</head><controlaccess><head>Organizations:</head <corpname>Daughters of the Republic of Texas.</corpname></controlaccess></controlaccess>

Authority Control at TARO

Yesterday, TARO Steering Committee Co-chair Amy Bowman e-mailed members of the consortium a link to all of the EAD <controlaccess> datasets, broken down by repository, that we extracted and wrangled this spring. I spoke briefly about our work during last month’s TARO brown bag presentation at the Society of Southwest Archivists’ annual meeting in Oklahoma City. Both Amy and I also thought it would be a good idea to publish here on TARO Today the more relevant sections of the report on consortial authority control that I wrote and submitted to the TARO standards committee as part of my final master’s degree Capstone project at UT-Austin’s School of Information. Each (work) day for the next week or so I’ll be posting, sequentially, another section of the document, beginning with the overview below. For a copy of the full report, please get in touch with Amy or e-mail me at tim [dot] kindseth [at] utexas [dot] edu. —Tim Kindseth


 

OVERVIEW

Control access, or index, terms are a well-established bibliographic convention. Within archival practice, however, the selection, use, search, and browsing of such terms is not so straightforward. Whereas books and other published items typically have well-defined scopes (and thereby topics), making the choice of control access terms rather intuitive or self-evident, it is much more difficult to choose just a handful of subjects or other authorities (persons, corporate bodies, genres, geographic place names) for, say, a collection of twenty-five boxes of unpublished manuscript material generated over four decades in the course of entirely unrelated activities and life events. Yet since the adoption of Encoded Archival Description in the late 1990s, archivists across the United States, Texas included, have been trying to do just that: select three to five (occasionally ten or more) representative index terms that will somehow do justice, will encompass, the startling breadth and depth of topics that a single archival collection can cover.

The hope is that these representative control access terms might function as arterials into archival finding aids, a genre that is still the source of much researcher confusion. Before EAD, the reference archivist was, for most researchers, among the main sources of information about any particular repository’s collections. Online EAD finding aids, one could argue, have come to play a similar role, transmitting to researchers, many of whom cannot easily travel to this or that collecting institution, not just information about individual collections but, in the case of an EAD consortium like the Online Archive of California (OAC) or Texas Archival Resources Online (TARO), information about how those collections relate to one another as well.

Relational collection mapping in theory makes material easier to find, more accessible and retrievable, and is the basis and goal of larger movements within information science like Linked Open Data and the Semantic Web. To get collections to talk to collections, though, is no easy task. Metadata from one finding aid must be able to converse with that of another, which requires an unforgiving level of shared data structure. For index terms to link up and self-aggregate across the repositories that comprise any consortium, control access terms must be crafted in exactly the same way across potentially dozens of institutions with varied familiarity with EAD and generally differing levels of archival expertise. Enter controlled vocabularies and best practices guidelines, both gentle nudges toward synchronicity in the ways in which archivists, many with dissimilar levels of experience or institutional support, encode their repository’s finding aids.

Rules are one thing; following them, however, is another. Katherine M. Wisser and Jackie Dean’s analysis of EAD tag usage across 1,136 finding aids from 108 anonymized repositories, published in The American Archivist in 2013, found that “little uniformity exists in encoding practices.” They concluded, “Variability in implementation of encoding standards has the potential to diminish the ability to aggregate records and effectively leverage structures for management and retrievability.” In 2014, Dr. Ciaran Trace and three others at UT-Austin looked at a set of 8,729 TARO finding aids and reached similar conclusions as Wisser and Dean about EAD data quality. “With humans in the mix,” they realized, “issues with the quality of the encoding can be expected.” This human hurdle must first be recognized before the issue of inconsistency can be surmounted. “Finding and documenting such problems with EAD encoding,” they argued, “is a key first step in instituting more rigorous control over descriptive and encoding practices that facilitate the aggregation, visualization and analysis of archival data.” Such aggregations and visualizations, which make possible the subject browsing and searching (faceted or otherwise) features that TARO is considering during its redesign, require clean data, and in order to clean it, you first have to locate the mess.

From January through May 2016, for my master’s Capstone project at UT-Austin’s School of Information, that was precisely my task: find where and in which ways TARO <controlaccess> values were dirty and, moreover, come up with ways to clean, or normalize, that data so that index terms, not currently searchable through TARO’s online interface, might in the future, with a revamping of that interface, be harnessed to provide subject searching and/or browsing, thereby increasing discoverability of the archival material described by TARO’s online finding aids. Amy Bowman of the Briscoe Center for American History, who supervised the project, and I performed BaseX queries on the more than 14,000 EAD documents from 46 repositories currently stored on TARO’s server. Over 153,000 <controlaccess> terms were extracted, converted into spreadsheets (grouped both by institution and by EAD element), and analyzed for common encoding errors or inconsistencies using OpenRefine’s clustering algorithms. All the while, a literature review on authority control and subject searching in archival settings was conducted. Several underlying, interrelated, unresolved sets questions emerged during the project:

  • If the 153,000-plus <controlaccess> terms encoded in TARO finding aids are to be normalized, against which controlled vocabularies should they be reconciled, and should the reconciliation occur federally (by TARO) or individually by each contributing repository?
  • What are the online information-seeking behaviors of archives researchers? In the age of Google and keyword searching, is topic/name browsing a thing of the past? If so, is consortial authority control a hobgoblin, an unnecessary expenditure of time and other resources? Have subject browsing features been effective for the consortiums, like Archives West, that have implemented them?
  • How will eventual implementation of EAD3, which was released last year, change the way contributing institutions must encode <controlaccess> terms, and what will be the benefits for search and discovery? To avoid repeating the same (rather complicated and onerous) process twice, should TARO wait until consortial adoption of EAD3 to normalize those terms in accordance with new encoding requirements?
  • How can the future selection and encoding of index terms (whether per EAD2002 or EAD3) be standardized (and remain so) across 46 contributing repositories? What best practices should be in place, and how strictly should they be enforced?

That final set of questions is perhaps the most crucial. My own personal belief is that for authority control to work, control must be part of the equation. Even if TARO is able to normalize all of its current <controlaccess> terms, without consortial enforcement of some kind there will be no guarantee, given the heterogeneous ways that institutions encode finding aids (manually keying the EAD in a text editor vs. generating it automatically with archival management software tools like ArchivesSpace), that future <controlaccess> metadata will be crafted uniformly across all repositories. To date, as our extraction and analysis of TARO’s 153,000 index terms has revealed, there has been very little consistency in the encoding of such terms. Tables breaking down the extracted data in various broad categories, by element, by controlled vocabulary, and by individual repository, can be found near the end of this document. What follows in the next section details some of the more frequent encoding errors and inconsistencies both across and within TARO’s contributing members. It is not at all unusual, for instance, for a subject, person, corporation, place, or other <controlaccess> element to be encoded in divergent ways by the same repository.

The section following that is more speculative, outlining general issues to bear in mind as TARO redesigns its interface. How well that interface functions hinges on the quality of the metadata beneath it, which the title of a 2009 OCLC report written by Jennifer Schaffner makes clear: “The Metadata is the Interface.” Schaffner emphasizes what’s at stake in any effort (like TARO’s) to improve the quality of descriptive metadata: “It would be heartbreaking,” she writes, “if special collections and archives remained invisible because they might not have the kinds of metadata that can easily be discovered by users on the open Web.”

1st draft available for review: TARO schema-compliant encoding guidelines

On behalf of Rebecca Romanchuk and Carla Alvarez, TARO Standards Committee co-chairs, please read the following asking for your feedback on the new schema-compliant encoding guidelines, which will be used by all TARO repositories after each repository is converted to schema compliance later this year.
Please know that doing your conversion, you will have oneonone contact with a TARO volunteer to help you get started submitting finding aids in schema format using these guidelines, but we welcome your feedback on the guidelines now. ___________________________________________________________________________The TARO Standards subcommittee is pleased to announce that we have completed our first draft of the
EAD 2002 Schema Best Practice Guidelines for TARO!

Texas Archival Resources Online (TARO), Texas’ EAD finding aid consortial site – https://www.lib.utexas.edu/taro/, is in the midst of an NEH planning grant to develop improved systems and updated standards for TARO as it achieves sustainability to serve the archival research community into the future. Part of this work is to create new encoding guidelines for TARO repositories that c onform to the EAD 2002 Schema encoding standard, which TARO will complete conversion to in 2016. These best practice guidelines (BPG) are available as a PDF at http://bit.ly/1Wk6p6W. The BPG appendices are a TARO-friendly sample Schema-compliant template for EAD encoding for your use, and an EAD finding aid ex ample. These appendices are also available at the same link as XML files.

We welcome feedback addressing every aspect of our BPG.

Go to http://goo.gl/forms/gaJXiCVtp4 to complete a brief survey to give us your ideas for how the BPG can better address your needs for EAD encoding. The survey is configured to adapt its questions depending on whether your repository is a TARO member, or if you are in Texas and have not yet joined TARO, or if you are outside of Texas and want to give us your general feedback.

Please complete the survey by Friday, June 3, 2016.

If you encode for TARO, we need to hear from you. The BPG, which will be a key tool for TARO participants, offers detailed guidance on creating EAD XML files. Even participants who export XML from software such as ArchivesSpace (and don’t see the raw XML) will need to follow TARO protocols as described in the BPG, such as formatting the <eadid>. You will need to follow the BPG in order to submit your Schema-compliant files to TARO, which each repository will be required to do by the end of 2016.

The co-chairs of the TARO Standards subcommittee extend sincere thanks to its members for their superb contributions to the BPG. Invaluable support has been provided during our drafting process by TARO Steering Committee co-chairs Amanda Focke and Amy Bowman, UT Libraries TARO technical support staff Minnie Rangel, and our NEH planning grant project manager Leigh Grinstead and grant consultant Jodi Allison-Bunnell. We are also grateful to the EAD consortial community at large for the encoding documentation they make available online, in particular Online Archive of California and Archives West, which are models that have guided us.

Cordially,

Carla Alvarez, MA, CA (co-chair – TARO Standards subcommittee)
Rare Books and Manuscripts
Nettie Lee Benson Latin American Collection
University of Texas at Austin

Rebecca Romanchuk, MLIS, CA (co-chair – TARO Standards subcommittee)
Team Lead, Archives / Archivist II
Archives and Information Services
Texas State Library and Archives Commission

TARO Standards subcommittee members:  
Maristella Feustle (UNT-Music Library),
Cynthia Franco (SMU-DeGolyer Library),
Molly Hults (Austin Public Library-Austin History Center),
Benna Vaughan (Baylor University-Texas Collection),
Jeffrey Warner (Rice University-Woodson Research Center).

Overview of Encoding Survey

Last month, I solicited EAD templates and documentation from partner institutions to get a clearer picture of TARO’s EAD landscape. Thank you to the 24 institutions that answered the questionnaire and provided documentation. The responses and accompanying documentation illuminate some of the shared (or similar) encoding practices across the TARO partners, as well as areas of encoding diversity. This knowledge will help me and the Steering Committee make useful recommendations for incorporating a schema-compliant workflow into existing practices. The goal is to find that sweet point between breadth and specificity, so that participation in TARO is both convenient and beneficial.     

Overall, there is plenty of common ground amongst the respondents in regards to encoding workflows and processes. The following is a very general overview of the survey responses:   

24 total responses

17 of the 24 of the institutions that responded to the survey described a process of encoding by hand using previous finding aids and/or templates as guides. MS Word and Excel are common tools used for creating collection inventories that are then copied and pasted into an XML editor.   

13 use Oxygen XML editor  

Finding aid creation is a multi-step, multi-tool process for everyone, and common ground bodes well as TARO moves toward greater standardization. Common tools, such as MS Excel and Oxygen XML editor can be incorporated and leveraged in best practices guidelines.  

As of right now, fewer organizations use archival management systems, while a handful of respondents expressed plans to adopt an AMS in the near future.

7 use AMS

3 ArchivesSpace
2 Archivists’ Toolkit
1 Archon
1 CuadraStar

As you may be aware, ArchivesSpace generates schema-compliant EAD. In fact, the AS output is sometimes stricter than the EAD 2002 schema . Currently, the institutions that use these archival management systems must reverse edit their EAD back to DTD to make it TARO compliant. With more organizations adopting (or at least considering) management systems, TARO must plan to accommodate current and future developments in technology. Updating the XML in TARO will not only improve the front-end user experience, but will also broaden potential participation.

The greatest variation across the respondents appears (quite obviously) in the documentation, instructions, and templates of each contributing institution. A large consideration going forward is finding the optimal level of standardization that benefits all contributing institutions. Participation in TARO should be easy, perhaps effortless. With this goal in mind, the question we need to ask is:

How can we reduce redundancies between unique institutional workflows and contributing to TARO?

Feel free to continue this conversation, especially if you feel that the overview above does not represent how your institution creates EAD.