Resource Cataloging and
Distribution Service (RCDS)

Keith Moore
Shirley Browne
Stan Green
Reed Wade

Netlib Development Group
University of Tennessee

January 16, 1996

1. Introduction
3. Protocols

Abstract

We describe an architecture for cataloging the characteristics of Internet-accessible resources, for replicating such resources to improve their accessibility, and for cataloging the current locations of the resources so replicated. Message digests and public-key authentication are used to ensure the integrity of the files provided to users. The service is designed to provide increased functionality with only minimal changes to either a client or a server. Resources can be named either by URNs or by existing URLs, and the service is designed to facilitate long-term resolution of resource names.

1. Introduction

Almost any user of the World Wide Web will be familiar with the following problems:

Frequently, the file server mentioned in a particular URL is ``down'', unreachable, or busy. This happens due to normal system or link failures, and also because servers of popular files must limit the number of concurrent users to keep from getting swamped.
While additional copies of the file named by that URL may exist (on so-called ``mirror'' sites), there is no mechanism for finding them, no way to know whether such copies are current, and no means of ensuring that the mirrored copy has not been altered.
URLs become ``stale'', that is, a URL which once pointed to a particular file no longer points to any version of that file. Any ``links'' to the file using the old URL are therefore no longer usable. This may be for any of several reasons: the domain name of the server has been changed, the file has been moved to a different server, the file has been renamed on the same server, the file was part of a multi-file document which has been re-organized, or the file has simply been deleted from that server (even though it may still be available elsewhere). In general, there is no way for the user to know which of these has happened and no way to recover even if those cases where the file is still accessible from some other location.
Search services are often out-of-date due to the sheer size of the net and the necessity to periodically ``poll'' each server to see whether its files have changed.
Search services that return URLs often return duplicate ``hits'' because the same file is accessible by multiple URLs, with no way for the search service to know that they indicate the same file.
There is a need to be able to label resources according to certain criteria, and for the user to be able to examine such labels before attempting to access the resource. The most publicized need for such a facility is to label files which might be considered obscene, but in general a user (or his web browser) would like to be able to examine catalog information that describes a resource to determine its suitability for the user's needs and/or whether the user has the requisite hardware, software, budget, and permissions to use the resource.
Finally, given the ease by which many file servers (whether primary servers or ``mirrors'') may be compromised, there is a need for a service that allows the integrity and authenticity of any file to be checked. This protection is required not only for computer programs, but also for documents of various types which can contain embedded macros and/or exploit security weaknesses in their associated applications. Even the so-called ``safe execution environments'' such as Java cannot be relied on to protect the user, absent careful development and extensive analysis of both the design and the implementation of the execution environment.

We therefore propose an architecture for a system which attempts to address these problems.

1.1 Design Goals

The goals of our system include:

It must be easy to deploy in the current Internet.
It must be highly reliable and fault-tolerant.
It must use the network efficiently.
It must provide adequate security, both to ensure that its authentication/integrity assurance services are trustworthy, and to thwart denial-of-service attacks.
It must be flexible and general so it can evolve to meet future needs.
It must be scalable to several orders of magnitude without fundamental changes in the structure of resource names, or in the means by which a resource name is resolved into the network location of a server that provides the resource.

These goals have certain implications for our design:

The flexibility goal dictates that the system should not assume present-day notions of roles such as ``author'', ``publisher'', or ``editor'' in determining who can supply information about a resource. It also compels us to accomodate multiple data models for use by catalog records, as well as a variety of cryptographic authentication and integrity checking algorithms. Likewise, the service should accomodate several different protocols for access and/or retrieval of resources, including those that will be defined in the future as well as those in use today.
The scalability, reliability, and network efficiency goals dictate that the system maintain replicated copies of the information which it provides and keep those copies in reasonable synchronization.
The goal of ease of deployment implies that the service should augment, rather than replace, the current world wide web infrastructure. Furthermore, authors or publishers should be find it easy to provide and maintain their own servers for the resources that they own.

1.2 Issues

The following issues must be considered:

Transition issues. In general, it is difficult to build new infrastructure in the Internet, because the infrastructure must be in place before its costs can be justified by its benefits, and because there is no mechanism by which a particular solution to these problems may be dictated. For a solution to win favor, it must therefore be more attractive to information providers and to browser implementors than both its competitors and the status quo. Other factors being equal, a solution which provides a smaller transition burden will be favored over a solution which imposes a larger one.
Security. It is difficult to provide network services which are immune to hostile attack. Doing so requires careful attention to both the implementation and the operation of the server machines, the ability to detect probable intrusions, sufficient logging to facilitate analysis of possible security breaches, and physical security of the machines. On the other hand, it is somewhat easier for non-networked machines to be secure.
DNS. There are both advantages and disadvantages to using the domain naming system as a component of a resource cataloging system. On the positive side, DNS is widely deployed and implementations are already available for most platforms. On the negative side, DNS is known to be insecure against attack, to have problems with stale data, to have difficulty tolerating domains with a large fan-out (like the .COM domain), and to be easy to misconfigure. All but the last of these problems are being addressed in the IETF, and solutions have been proposed in draft documents. Similar issues would be encountered in any other widely distributed database.

The assumed significance of transition issues on the success of the project influenced our design in the following ways: we allow ordinary URLs as one kind of resource name, we use existing file servers and file access protocols, and we employ DNS as a component of the system rather than building a new distributed database from the ground up. The need for reliable authentication and integrity assurances, coupled with the difficulty of providing secure servers, influenced us to use end-to-end (between information provider and user) authentication, consisting of public-key signatures and cryptographically signed certificates, rather than depending on the security of resource catalog servers or file servers (though reasonable security for these is still required to thwart denial-of-service attacks). Finally, some of the inherent limitations of DNS and the desire to separate administration of ``naming authority'' names from administration of resource names for a particular naming authority, led us to use DNS only as a means to identify one or more resource catalog servers for a particular resource naming authority, rather than to provide actual location or catalog information directly through DNS.

1.3 Non-Goals

The following were deliberately omitted from our design goals:

The system does not perform searches. Resource discovery tools are still an active area of research. The search engines which are effective today (which return mostly relevant citations without returning irrelevant ones) are likely to be highly tuned to a particular subject domain and/or to require significant user expertise. Rather than attempt to engineer a resource discovery system that would work well for all existing subject areas, we chose to engineer a cataloging and distribution system that could be used as a common substrate for present and future resource discovery tools.
The system does not explicitly support protection of intellectual property. While everyone agrees that some form of intellectual property protection is needed to protect the interests of information providers, there is wide disagreement about what kind of protection is appropriate, and about the appropriate form of copyright in cyberspace. While it is possible to include pricing information and usage restrictions in the description of a resource, it would be inappropriate to impose a single model for such restrictions on all resources.

2. Description of RCDS

The Resource Cataloging and Distribution System (RCDS) consists of the following components:

Clients, which are the consumers of the resources provided by the system. RCDS clients are ordinary WWW browsers with slight modifications to make use of the resolution system. Unmodified WWW browsers can also use RCDS through the use of a RCDS-aware proxy server. A browser which supports Java may access RCDS via a special applet.
File servers, which provide access to the files themselves. These can be ordinary http, ftp, etc. servers.
Resource catalog servers, which maintain information about the characteristics of a network-accessible resources and accept queries about the characteristics of such resources from clients.
Location servers, which maintain information about the locations of network-accessible resources, and accept queries for location data from clients.
Collection managers. The collection of files on a file server is maintained by a collections manager, which learns about newly published files and determines when a file server should acquire new files and reap old ones, according to site-specified criteria. The collections manager is also responsible for actually acquiring and deleting the chosen files. Finally, when a new file is added to the collection or an old one removed, the collections manager informs the location servers about changes in file availability.
Publication tools, which accept new files and descriptions from content providers (e.g. authors), and inject them into the system.

2.1 Resource names

RCDS uses three kinds of resource names: URLs, URNs, and LIFNs. Web users will already be familiar with the syntax of URLs and how they are used. For those who are also familiar with URNs, RCDS assumes a specific format for URNs which is described below.

2.1.1 URNs and LIFNs

In RCDS, URNs are used to provide stable names for resources whose characteristics may vary over time. By contrast, a LIFN is used to name a specific instance of a resource, all copies of which must be identical. A URN is associated with a description of the resource it names, while a LIFN is associated with with one or more locations of identical copies of that resource.

The description associated with a URN will normally contain one or more LIFNs, which describe particular instances of that resource and the differences between them. For instance, if the resource named by a particular URN exists in several different data formats (e.g. plain text, PostScript, PDF, HTML), the description for that URN will list each of these, along with a LIFN for that specific instance. Similarly, if the resource associated with a URN has changed over time, and multiple versions of the resource are still accessible, the description of that resource might contain a list of the current and previous versions along with the LIFNs for each. Since the LIFN can then be used to find the current locations of a resource, it serves as a ``link'' or ``file handle'' from the description of a resource to the list of its current locations.

The distinction between URNs and LIFNs was crafted for several reasons:

Location data and descriptions are maintained by different parties. The description of a resource will normally be maintained by its ``author'', ``publisher'', ``editor'', or ``reviewers'', while the location data will be maintained by the managers of specific file servers.
Since the set of replicated copies of a file may need to change quickly according to demand, location data for a resource is expected to change more frequently than the description of a resource itself. It is therefore useful if the location directory can react quickly to changes to the available locations of a file.
The location directory must be replicated to ensure high availability. And yet, it is rarely important to locate all copies of a file; most clients only need to a single location from which the file is available. There is therefore little need to maintain a consistent list of locations across each of the location servers. On the other hand, it can be very important to have an up-to-date description of a resource and for the replicated copies of that description to be consistent with one another. The location directory (accessed by LIFN) and the description server (accessed by URN) therefore have different needs for consistency across replicas.
The portions of RCDS responsible for replicating files and keeping track of their locations need an unambiguous name for a particular instance of a resource, to avoid confusing it with other instances of the same resource.
If all instances of a file associated with a LIFN are identical, the client's choice of which instance of a resource to access (data type, version, etc) may be cleanly separated from its choice of which location to use when accessing the resource. The former choice can then be made on the basis of browser capabilities, user requirements, etc., while the latter choice can be based on (say) proximity estimates.

2.1.2 Format of URNs and LIFNs

An RCDS URN consists of three parts, separated by the "/" character.

A fixed prefix string, e.g. URN:/ or LIFN:/.
A naming authority name, which is simply an an Internet domain name, (though the domain name used by a naming authority may be chosen to have certain useful characteristics).
A suffix string, which is an identifier assigned by the naming authority.

So "URN://foo.bar/mumblefrotz" would be a URN that was assigned by the naming authority foo.bar. URNs, at least in the current RCDS prototype, are thus syntactically similar to URLs.

2.2 Publication and Distribition

Figure 1 illustrates how files are published in RCDS.

An author submits a file to RCDS using a publication tool. If this is a new file, a new description (containing catalog information) of that file is created and a new URN is assigned; otherwise, the description of the old URN is updated to reflect the new version of the file. A LIFN is assigned to the new file, and this LIFN is included in the description of that file. The part of the description containing the LIFN and file fingerprint (and perhaps other parts of it) are cryptographically signed by the author using the publication tool.
The publication tool deposits a copy of the file on a file server, and a copy of the description on a ``master'' resource catalog server. It also sends a copy of the description of the new file to interested parties, which might include file servers and search services.
The ``master'' resource catalog server updates its slave servers with the new description.
The ``master'' file server informs a location server that it has a copy of the file with that particular LIFN.
As other file servers find out about the existance of the new file, their collections managers decide whether to acquire it. When a file server acquires the new file and makes it accessible, it informs a location server about it.
The location servers propagate new file location information to one another.

2.3 Access and Retrieval

Figure 2 illustrates how files are accessed or retrieved in RCDS.

A user acquires a URN of a resource that seems to suit his needs from a search service, hypertext link, or other means. This URN is resolved using DNS (see below) to find the network addresses of one or more resource catalog servers. One of those servers is selected by the client, perhaps based on network proximity estimates.
The resource catalog server is queried for a descripton of the resource named by the URN. The description may contain multiple LIFNs, each describing a different version of the resource. The client selects a particular LIFN from those available.
The client resolves the LIFN using DNS to find the network addresses of one or more location servers. One of those location servers is then queried for locations of the file named by that LIFN.
The location server returns one or more URLs at which the file can be obtained.
The client chooses one of those file servers (again, perhaps based on network proximity estimates) and fetches the file from that server.

The interaction with RCDS may be accomplished either directly by a client, or via a proxy server which communicates with the client via HTTP. This arrangement is shown in Figure 3.

3. Protocols

Because an understanding of some of the protocol details is important to understand how well RCDS acheives it goals, this section outlines important aspects of the protocols used by the current prototype.

3.1 URN/LIFN resolution

RC servers are registered for a particular naming authority by adding resource records to the DNS. A new record type of RCS is assumed. It has a format identical to an MX record, but instead of designating a mail exchanger host, it designates a host which operates a resource catalog server for that domain.

So the records:

foo.bar         RCS     10      server-1.foo.bar.
                RCS     20      server-2.foo.bar.

say that the resource catalog servers for the naming domain foo.bar can be found at server-1.foo.bar and server-2.foo.bar, respectively.

Given a URN or a LIFN, an RC server for that URN or LIFN may be found using DNS as follows:

The naming authority name is extracted from the URN or LIFN.
A DNS lookup is performed on the naming authority name with QTYPE=RCS. The query returns a list of the ``official'' servers for that domain. (If no DNS records were returned, no official servers are available.)
The client chooses one of the available servers.
The client then sends query or update requests to that server.

If the first server chosen fails to respond to the query, the client may choose another of the listed servers. Clients may also be configured to consult ``proxy'' RC servers (which perform queries on behalf of clients and cache results) as well as ``fallback'' (e.g., custodial) servers (which can be consulted when there are no ``official'' servers for a domain or when the ``official'' servers do not respond.)

moore@cs.utk.edu

Resource Cataloging and Distribution Service (RCDS)