Resource Cataloging and
Distribution Service (RCDS)
Keith Moore
Shirley Browne
Stan Green
Reed Wade
Netlib Development Group
University of Tennessee
January 16, 1996
Abstract
We describe an architecture for cataloging the characteristics of
Internet-accessible resources, for replicating such resources to
improve their accessibility, and for cataloging the current locations
of the resources so replicated. Message digests and public-key
authentication are used to ensure the integrity of the files provided
to users. The service is designed to provide increased functionality
with only minimal changes to either a client or a server. Resources
can be named either by URNs or by existing URLs, and the service is
designed to facilitate long-term resolution of resource names.
Almost any user of the World Wide Web will be familiar with the
following problems:
-
Frequently, the file server mentioned in a particular URL is ``down'',
unreachable, or busy. This happens due to normal system or link
failures, and also because servers of popular files must limit the
number of concurrent users to keep from getting swamped.
-
While additional copies of the file named by that URL may exist (on
so-called ``mirror'' sites), there is no mechanism for finding them,
no way to know whether such copies are current, and no means of
ensuring that the mirrored copy has not been altered.
-
URLs become ``stale'', that is, a URL which once pointed to a
particular file no longer points to any version of that file. Any
``links'' to the file using the old URL are therefore no longer
usable. This may be for any of several reasons: the domain name of
the server has been changed, the file has been moved to a different
server, the file has been renamed on the same server, the file was
part of a multi-file document which has been re-organized, or the file
has simply been deleted from that server (even though it may still be
available elsewhere). In general, there is no way for the user to
know which of these has happened and no way to recover even if those
cases where the file is still accessible from some other location.
-
Search services are often out-of-date due to the sheer size of the net
and the necessity to periodically ``poll'' each server to see whether
its files have changed.
-
Search services that return URLs often return duplicate ``hits''
because the same file is accessible by multiple URLs, with no way for
the search service to know that they indicate the same file.
-
There is a need to be able to label resources according to certain
criteria, and for the user to be able to examine such labels before
attempting to access the resource. The most publicized need for such
a facility is to label files which might be considered obscene, but in
general a user (or his web browser) would like to be able to examine
catalog information that describes a resource to determine its
suitability for the user's needs and/or whether the user has the
requisite hardware, software, budget, and permissions to use the
resource.
-
Finally, given the ease by which many file servers (whether primary
servers or ``mirrors'') may be compromised, there is a need for a
service that allows the integrity and authenticity of any file to be
checked. This protection is required not only for computer programs,
but also for documents of various types which can contain embedded
macros and/or exploit security weaknesses in their associated
applications. Even the so-called ``safe execution environments'' such
as Java cannot be relied on to protect the user, absent careful
development and extensive analysis of both the design and the
implementation of the execution environment.
We therefore propose an architecture for a system which attempts to
address these problems.
1.1 Design Goals
The goals of our system include:
-
It must be easy to deploy in the current Internet.
-
It must be highly reliable and fault-tolerant.
-
It must use the network efficiently.
-
It must provide adequate security, both to ensure that its
authentication/integrity assurance services are trustworthy, and to
thwart denial-of-service attacks.
-
It must be flexible and general so it can evolve to meet future needs.
-
It must be scalable to several orders of magnitude without fundamental
changes in the structure of resource names, or in the means by which a
resource name is resolved into the network location of a server that
provides the resource.
These goals have certain implications for our design:
-
The flexibility goal dictates that the system should not assume
present-day notions of roles such as ``author'', ``publisher'', or
``editor'' in determining who can supply information about a resource.
It also compels us to accomodate multiple data models for use by
catalog records, as well as a variety of cryptographic authentication and
integrity checking algorithms. Likewise, the service should
accomodate several different protocols for access and/or retrieval of
resources, including those that will be
defined in the future as well as those in
use today.
-
The scalability, reliability, and network efficiency goals dictate
that the system maintain replicated copies of the information
which it provides and keep those copies in reasonable synchronization.
-
The goal of ease of deployment implies that the service should
augment, rather than replace, the current world wide web
infrastructure. Furthermore, authors or publishers should be
find it easy to provide and maintain their own servers for the resources that
they own.
1.2 Issues
The following issues must be considered:
- Transition issues. In general, it is difficult to build new
infrastructure in the Internet, because the infrastructure must be in
place before its costs can be justified by its benefits, and because
there is no mechanism by which a particular solution to these problems
may be dictated. For a solution to win favor, it must therefore be
more attractive to information providers and to browser implementors
than both its competitors and the status quo. Other factors being
equal, a solution which provides a smaller transition burden will be
favored over a solution which imposes a larger one.
-
Security. It is difficult to provide network services which are
immune to hostile attack. Doing so requires careful attention to both
the implementation and the operation of the server machines, the
ability to detect probable intrusions, sufficient logging to
facilitate analysis of possible security breaches, and physical
security of the machines. On the other hand, it is somewhat easier
for non-networked machines to be secure.
-
DNS. There are both advantages and disadvantages to using the
domain naming system as a component of a resource cataloging system.
On the positive side, DNS is widely deployed and implementations are
already available for most platforms. On the negative side, DNS is
known to be insecure against attack, to have problems with stale data,
to have difficulty tolerating domains with a large fan-out (like the
.COM domain), and to be easy to misconfigure. All but the last of
these problems are being addressed in the IETF, and solutions
have been proposed in draft documents. Similar issues would be
encountered in any other widely distributed database.
The assumed significance of transition issues on the success of the
project influenced our design in the following ways: we allow ordinary
URLs as one kind of resource name, we use existing file servers and file
access protocols, and we employ DNS as a component of the system
rather than building a new distributed database from the ground up.
The need for reliable authentication and integrity assurances, coupled
with the difficulty of providing secure servers, influenced us to use
end-to-end (between information provider and user) authentication,
consisting of public-key
signatures and cryptographically signed certificates, rather than
depending on the security of resource catalog servers or file servers
(though reasonable security for these is still required to thwart
denial-of-service attacks). Finally, some of the inherent limitations
of DNS and the desire to separate administration of ``naming
authority'' names from administration of resource names for a
particular naming authority, led us to use DNS only as a means to
identify one or more resource catalog servers for a particular
resource naming authority, rather than to provide actual location or
catalog information directly through DNS.
1.3 Non-Goals
The following were deliberately omitted from our design goals:
-
The system does not perform searches. Resource discovery tools are
still an active area of research. The search engines which are effective today
(which return mostly relevant citations without returning irrelevant
ones) are likely to be highly tuned to a particular subject domain
and/or to require significant user expertise. Rather than attempt to
engineer a resource discovery system that would work well for all
existing subject areas, we chose to engineer a cataloging and
distribution system that could be used as a common substrate for
present and future resource discovery tools.
-
The system does not explicitly support protection of intellectual
property. While everyone agrees that some form of intellectual
property protection is needed to protect the interests of information
providers, there is wide disagreement about what kind of protection is
appropriate, and about the appropriate form of copyright in
cyberspace. While it is possible to include pricing information and
usage restrictions in the description of a resource, it would be
inappropriate to impose a single model for such restrictions on all
resources.
The Resource Cataloging and Distribution System (RCDS) consists of the
following components:
- Clients, which are the consumers of the resources provided by
the system. RCDS clients are ordinary WWW browsers with slight
modifications to make use of the resolution system. Unmodified WWW
browsers can also use RCDS through the use of a RCDS-aware proxy
server. A browser which supports Java may access RCDS via a special
applet.
-
File servers, which provide access to the files themselves.
These can be ordinary http, ftp, etc. servers.
-
Resource catalog servers, which maintain information about the
characteristics of a network-accessible resources and accept queries
about the characteristics of such resources from clients.
-
Location servers, which maintain information about the locations
of network-accessible resources, and accept queries for location data
from clients.
-
Collection managers. The collection of files on a file server
is maintained by a collections manager, which learns about newly
published files and determines when a file server should acquire new
files and reap old ones, according to site-specified criteria. The
collections manager is also responsible for actually acquiring and
deleting the chosen files. Finally, when a new file is added to the
collection or an old one removed, the collections manager informs the
location servers about changes in file availability.
-
Publication tools, which accept new files and descriptions from
content providers (e.g. authors), and inject them into the system.
2.1 Resource names
RCDS uses three kinds of resource names: URLs, URNs, and LIFNs. Web
users will already be familiar with the syntax of URLs and how they
are used. For those who are also familiar with URNs, RCDS assumes a
specific format for URNs which is described below.
2.1.1 URNs and LIFNs
In RCDS, URNs are used to provide stable names for resources whose
characteristics may vary over time. By contrast, a LIFN is used to
name a specific instance of a resource, all copies of which must be
identical. A URN is associated with a description of the
resource it names, while a LIFN is associated with with one or more
locations of identical copies of that resource.
The description associated with a URN will normally contain one or
more LIFNs, which describe particular instances of that resource and
the differences between them. For instance, if the resource named by
a particular URN exists in several different data formats (e.g. plain
text, PostScript, PDF, HTML), the description for that URN will list
each of these, along with a LIFN for that specific instance.
Similarly, if the resource associated with a URN has changed over
time, and multiple versions of the resource are still accessible, the
description of that resource might contain a list of the current and
previous versions along with the LIFNs for each. Since the LIFN can
then be used to find the current locations of a resource, it serves as
a ``link'' or ``file handle'' from the description of a resource to
the list of its current locations.
The distinction between URNs and LIFNs was crafted for several
reasons:
-
Location data and descriptions are maintained by different parties.
The description of a resource will normally be maintained by its
``author'', ``publisher'', ``editor'', or ``reviewers'', while the
location data will be maintained by the managers of specific file
servers.
-
Since the set of replicated copies of a file may need to change
quickly according to demand, location data for a resource is expected
to change more frequently than the description of a resource itself.
It is therefore useful if the location directory can react quickly to
changes to the available locations of a file.
-
The location directory must be replicated to ensure high availability.
And yet, it is rarely important to locate all copies of a file;
most clients only need to a single location from which the file is
available. There is therefore little need to maintain a consistent
list of locations across each of the location servers. On the other
hand, it can be very important to have an up-to-date description of a
resource and for the replicated copies of that description to be
consistent with one another. The location directory (accessed by
LIFN) and the description server (accessed by URN) therefore have
different needs for consistency across replicas.
-
The portions of RCDS responsible for replicating files and keeping
track of their locations need an unambiguous name for a particular
instance of a resource, to avoid confusing it with other instances of
the same resource.
-
If all instances of a file associated with a LIFN are identical, the
client's choice of which instance of a resource to access (data type,
version, etc) may be cleanly separated from its choice of which
location to use when accessing the resource. The former choice can
then be made on the basis of browser capabilities, user requirements,
etc., while the latter choice can be based on (say) proximity
estimates.
2.1.2 Format of URNs and LIFNs
An RCDS URN consists of three parts, separated by the "/"
character.
- A fixed prefix string, e.g. URN:/ or LIFN:/.
-
A naming authority name, which is simply an an Internet domain name,
(though the domain name used by a naming authority may be chosen to
have certain useful characteristics).
-
A suffix string, which is an identifier assigned by the naming
authority.
So "URN://foo.bar/mumblefrotz" would be a URN that was
assigned by the naming authority foo.bar. URNs, at least in the
current RCDS prototype, are thus syntactically similar to URLs.
2.2 Publication and Distribition
Figure 1 illustrates how files are published in RCDS.
- An author submits a file to RCDS using a publication tool. If this is
a new file, a new description (containing catalog information) of that
file is created and a new URN is assigned; otherwise, the description
of the old URN is updated to reflect the new version of the file. A
LIFN is assigned to the new file, and this LIFN is included in the
description of that file. The part of the description containing the
LIFN and file fingerprint
(and perhaps other parts of it) are cryptographically signed by
the author using the publication tool.
-
The publication tool deposits a copy of the file on a file server, and
a copy of the description on a ``master'' resource catalog server. It
also sends a copy of the description of the new file to interested
parties, which might include file servers and search services.
-
The ``master'' resource catalog server updates its slave servers with
the new description.
-
The ``master'' file server informs a location server that it has a
copy of the file with that particular LIFN.
-
As other file servers find out about the existance of the new file,
their collections managers decide whether to acquire it. When a file
server acquires the new file and makes it accessible, it informs a
location server about it.
-
The location servers propagate new file location information to one
another.
2.3 Access and Retrieval
Figure 2 illustrates how files are accessed or retrieved in RCDS.
- A user acquires a URN of a resource that seems to suit his needs from
a search service, hypertext link, or other means. This URN is resolved
using DNS (see below) to find the network addresses of one or more
resource catalog servers. One of those servers is selected by the
client, perhaps based on network proximity estimates.
-
The resource catalog server is queried for a descripton of the
resource named by the URN. The description may contain multiple
LIFNs, each describing a different version of the resource. The
client selects a particular LIFN from those available.
-
The client resolves the LIFN using DNS to find the network addresses
of one or more location servers. One of those location servers is
then queried for locations of the file named by that LIFN.
-
The location server returns one or more URLs at which the file can be
obtained.
-
The client chooses one of those file servers (again, perhaps based on
network proximity estimates) and fetches the file from that server.
The interaction with RCDS may be accomplished either directly by a
client, or via a proxy server which communicates with the client via
HTTP. This arrangement is shown in Figure 3.
Because an understanding of some of the protocol details is important
to understand how well RCDS acheives it goals, this section outlines
important aspects of the protocols used by the current prototype.
3.1 URN/LIFN resolution
RC servers are registered for a particular naming authority by adding
resource records to the DNS. A new record type of RCS is
assumed. It has a format identical to an MX record, but
instead of designating a mail exchanger host, it designates a host
which operates a resource catalog server for that domain.
So the records:
foo.bar RCS 10 server-1.foo.bar.
RCS 20 server-2.foo.bar.
say that the resource catalog servers for the naming domain
foo.bar can be found at server-1.foo.bar and
server-2.foo.bar, respectively.
Given a URN or a LIFN, an RC server for that URN or LIFN may be found
using DNS as follows:
-
The naming authority name is extracted from the URN or LIFN.
-
A DNS lookup is performed on the naming authority name with QTYPE=RCS.
The query returns a list of the ``official'' servers for that domain.
(If no DNS records were returned, no official servers are available.)
-
The client chooses one of the available servers.
-
The client then sends query or update requests to that server.
If the first server chosen fails to respond to the query, the client
may choose another of the listed servers. Clients may also be
configured to consult ``proxy'' RC servers (which perform queries on
behalf of clients and cache results) as well as ``fallback''
(e.g., custodial) servers
(which can be consulted when there are no ``official'' servers for a
domain or when the ``official'' servers do not respond.)
moore@cs.utk.edu