Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
Cataloging has basically remained unchanged for decades. Despite the development of Machine-Readable Cataloging (MARC) and the Anglo-American Cataloging Rules, 2d Edition (AACR2), what is recorded about a library item is the same as it was when we used handwritten catalog cards. Today, many library catalogs simply duplicate the catalog card on a computer screen. Now, the game has changed. In the digital library, we no longer deal with the typical printed book or serial. We may need to describe a collection of digitized photographs, or a series of pages that must somehow be navigable as a logical whole like the printed book from which they are derived. And we must keep track of such things as how the digital representation was captured and manipulated. Librarians have historically called this "cataloging." We in digital library work call it "metadata." Metadata, simply put, is structured information about information. The key is "structured." In metadata as in cataloging, a free-text description usually won't suffice. Rather, in order to limit a search to a particular field, the information must be structured, often highly so. That is why MARC has tag and subfield markers, which allow software to understand exactly how to treat each descriptive element. Does MARC translate? So why don't digital library projects use MARC? Some do, such as when records for digital objects are merged with a library catalog and loaded into the catalog as with the record for a print book. The State Library of Victoria Multimedia Catalogue has over 120,000 records for digital objects. However, for many purposes, MARC is a poor fit. In some cases it is too complex, requiring highly trained staff and specialized input systems; in others, it is too focused on print material and can't be extended for digital collections. Digital librarians have identified three categories of metadata information about digital resources: descriptive (also called intellectual), structural, and administrative. Of these categories, MARC really only deals well with intellectual metadata. Descriptive metadata includes the creator of the resource, its title, appropriate subject headings -- basically the kinds of elements that will be used to search for and locate the item. Structural metadata describes how the item is structured. In a book, pages follow another. But as a digital object, if each page is scanned as an image, metadata must "bind" hundreds of separate computer files together into a logical whole and provide ways to navigate the digital "book." Administrative metadata may include such things as how the digital file was produced and its ownership. All of this potential metadata needs containers. However, most of the metadata described above has no standard container waiting to receive it, as MARC receives the information specified by AACR2. There are, however, some emerging standards that may be to digital libraries what MARC was to print-based libraries. Dublin Core emerges The best general purpose metadata draft standard is the Dublin Core. The Dublin Core represents a multiyear (and ongoing) effort by librarians, computer scientists, museum professionals, and others to devise a simple yet extensible standard that could be used to describe a wide variety of objects within a wide variety of subject disciplines and systems. The Dublin Core consists of 15 elements such as title, subject, and so on. The element names and basic purposes are fixed, but most details regarding them remain unresolved. Meanwhile, dozens of projects around the world are now using it. The Nordic Metadata Project. A consortium of Nordic countries working to create a metadata production and use system has created a utility for translating Dublin Core records into MARC and vice versa. DSTC Resource Discovery Unit. This Australian organization uses the Dublin Core in a variety of projects. UK Office of Library Networking (UKOLN). UKOLN provides a wealth of software for metadata production and utilization, focusing on the Dublin Core. Outside the Core While the Dublin Core specifies certain elements to describe an item, it does not specify a transfer syntax or a MARC equivalent. For now, it appears that the emerging Resource Description Framework (RDF), produced by the World Wide Web Consortium (W3C), will provide one of the best methods for encoding this information in a machine-parsable form. RDF is itself based on Extensible Markup Language (XML), which is an emerging standard that will likely have a great impact not only on resource description but on the web itself. XML provides users with a structured way in which to encode just about anything, from web pages to database entries. XML represents an advance over current HTML, which offers very little structural information embedded in a document. For now, searching on the web is scattershot. XML will allow users to search for words in section headings or in an author field, so we will be able to search web documents the way we now search library catalogs. It is likely that the upcoming 5.0 versions of both Netscape Navigator and Microsoft Internet Explorer will offer some level of native support for XML. This would allower users to add more powerful and flexible services to a web server while still providing other information in HTML. But while XML 1.0 is now stable, related standards, including RDF, are still being developed. There seems to be a groundswell of industry opinion, however, that XML is the future of the web. Keep your eye on the World Wide Web Consortium (W3C) and the XML.com site. While the Dublin Core is useful for describing individual objects, there is another draft standard that is useful for describing collections of objects, specifically archival materials. The Encoded Archival Description (EAD) is the emerging standard for creating machine-readable archival finding aids. EAD is an example of a Document Type Definition (DTD), which specifies how archival finding aids should be tagged using the Standard Generalized Markup Language (SGML). Although the standards effort began at UC-Berkeley, it is now managed by the Library of Congress. See examples at EAD Sites on the Web. These emerging standards all attempt to provide a highly structured way to describe various digital objects and make them easy to locate and use. That, after all, is what cataloging is all about. LINK LIST DSTC Resource Discovery Unit http://www.dstc.edu.au/RDU/ Dublin Core http://purl.org/metadata/dublin_core EAD Sites on the Web http://www.loc.gov/ead/eadsites.html Encoded Archival Description (EAD) http://lcweb.loc.gov/ead MARC http://lcweb.loc.gov/marc/ Metadata Information http://www.nlc-bnc.ca/ifla/II/ metadata.htm Nordic Metadata Project http://linnea.helsinki.fi/meta/ Resource Description Framework (RFD) http://www.w3.org/RDF/ UK Office of Library Networking (UKOLN) http://www.ukoln.ac.uk/metadata/ State Library of Victoria Multimedia Catalog http://www.slv.vic.gov.au/slv/catalogue/ XML at the W3C http://www.w3.org/XML/ XML.com site http://www.xml.com/