Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
Interoperability: The Holy Grail
As fast as digital libraries are being built, they still remain islands of order in a sea of chaos. Locating them by using web search engines or subject directories is just the first step in a long process. Users must then go to each one, searching or browsing it before moving on to the next. This laborious method for locating digital library objects (from full-length books to individual photographs) is obviously anachronistic. What digital librarians envision instead is an infrastructure that supports simultaneous searching of multiple and geographically distant collections. Anyone who discovers any individual digital library should be able to search easily (and perhaps transparently) across a wide variety of collections from other libraries worldwide. That, at least, is the vision. So what will make this vision a reality? There are different models for achieving this level of interoperability between libraries in support of resource discovery and retrieval. In the end, it probably isn't so important what model we use as long as we get the result we seek. Whereas I recently focused on metadata standards and draft standards for cataloging digital objects ("21st-Century Cataloging"), here I look at how to provide cross-collection searching of these records. Because digital libraries are still largely in the experimentation and research stages, diversity is more prevalent than standardization. The union catalog model One way to achieve seamless access to a variety of physically distant collections is to contribute bibliographic records or access aids to a central database. Librarians are, of course, experienced at this, having built OCLC, the largest union catalog of bibliographic records in the world. For digital library objects, however, we do not have an equivalent union catalog. Probably the best example of this model is presented by the Library of Congress (LC). As part of its National Digital Library Competition (jointly sponsored with Ameritech), LC has proposed serving as a central repository for "coherent access aids" (e.g., MARC records, Dublin Core records, or archival finding aids encoded in SGML), while the actual digital objects themselves would remain at their individual host institutions. Caroline Arms's 1997 paper "Access Aids and Interoperability" describes this model. Another example is the University of New Brunswick Library's metadata project. Records for digital objects hosted at several institutions were automatically "crawled" or gathered by a software program on a regular basis. They were then processed into a common format (in SGML) and indexed using Open Text software. This project proved the viability of the concept of a union catalog built by gathering records from distributed collections on a regular basis and indexing them centrally. Once the appropriate routines or programs are in place, records can be regularly produced without human intervention. A quite different way to approach interoperability is to establish standards to which all digital libraries would adhere and then provide an interface to search all the collections simultaneously. This exists to some degree now, as the Networked Computer Science Technical Reports Library (NCSTRL). NCSTRL provides one-stop shopping for CS tech reports from hundreds of institutions around the world by requiring that each site install the same software package (Dienst) and create bibliographic records using the same format (RFC 1807). At any of the NCSTRL sites, the search is sent simultaneously to all other sites; then those sites search their local index and return their results, the results are received and collated by the initiating site, and they are displayed to the user. When a particular record or report is requested, the remote server that has the report responds to the request. That, at least, is the model. However, due to poor response times, the bibliographic records are gathered from NCSTRL sites and indexed centrally at two or three index servers. In a sense, this model has retreated to that of a union catalog. The "intelligent agent" model Yet another method may be to create an "intelligent agent" (a special kind of software program) that can roam the network searching digital libraries for objects of interest. The agent would report back periodically with any results. Requirements for success include, at minimum, that the agent know where to find digital libraries, have the capacity to query these libraries appropriately, and possess methods to process search results into a common format for merging and browsing. One benefit: as long as the agent knows how to perform queries, the underlying architecture of each digital library can be different. Intelligent agents are unlikely to do well with an uncategorized, all-inclusive database like the web. But digital library catalogs are, if anything, the exact opposite. They are organized collections of selected objects of a similar nature. They usually support highly specific queries and will frequently return useful results. These factors make intelligent agents a real possibility for providing an appearance of interoperability when none may exist by design. However intriguing, I don't know of a working example of such an agent. Whither interoperability? Of these models, only the union catalog model is fully functional with present technology. Although the distributed searching model is interesting, slow server and network response time makes it presently impractical. The lack of prototype systems makes it difficult to assess the intelligent agent model. Differing levels of bibliographic description create a barrier to interoperability with all of these models. Some items are described only at the collection level (in the case of archival finding aids), while others are described at the item level (MARC and Dublin Core records). Thus, a user may be required to search different systems or else navigate results that mix individual items with collection descriptions. This watershed divide in how digital objects are described probably presents the biggest barrier to seamless interoperability. We seem at least on the right path. Most digital library projects describe their objects using some type of standard or developing standard, thus making it possible to migrate their records to whichever becomes the clear winner. A number of cooperative projects are underway, in which libraries work together to provide easy access to their combined collections. And organizations like the Digital Library Federation and LC work toward the goal of interoperability. So, although we are still in the early stages of achieving the kind of vision that many digital librarians have of easy access to digital collections around the world, we are close enough to have gained some experience along the way. LINK LIST "Access Aids and Interoperability" http://memory.loc.gov/ammem/ award/docs/interop.html Digital Library Federation http://lcweb.loc.gov/loc/ndlf/ Dublin Core http://purl.org/metadata/ dublin_core Encoded Archival Description (EAD) http://lcweb.loc.gov/ead/ Intelligent Agents http://www.cs.umbc.edu/agents/ Library of Congress http://www.loc.gov/ National Digital Library Competition http://memory.loc.gov/ ammem/award/ NCSTRL http://www.ncstrl.org/ Request for Comments (RFC) 1807 ftp://ftp.isi.edu/in-notes/rfc1807.txt University of New Brunswick Library's Metadata Project http://www.lib.unb.ca/metadata/