Public Identifiers, System Identifiers, and Catalog Files

When a DTD or other external file is referenced from a document, the reference can be specified in three ways: using a public identifier, a system identifier, or both. In XML, the system identifier is generally required and the public identifier is optional. In SGML, neither is required, but at least one must be present.[1]

A public identifier is a globally unique, abstract name, such as the following, which is the official public identifier for DocBook V3.1:

-//OASIS//DTD DocBook V3.1//EN


The introduction of XML has added some small complications to system identifiers. In SGML, a system identifier generally points to a single, local version of a file using local system conventions. In XML, it must point with a Uniform Resource Indicator (URI). The most common URI today is the Uniform Resource Locator (URL), which is familiar to anyone who browses the Web. URLs are a lot like SGML system identifiers, because they generally point to a single version of a file on a particular machine. In the future, Uniform Resource Names (URN), another form of URI, will allow XML system identifiers to have the abstract characteristics of public identifiers.

The following filename is an example of an SGML system identifier:

/usr/local/sgml/docbook/3.1/docbook.dtd
An equivalent XML system identifier might be:
file:///usr/local/sgml/docbook/3.1/docbook.dtd


The advantage of using the public identifier is that it makes your documents more portable. For any system on which DocBook is installed, the public identifier will resolve to the appropriate local version of the DTD (if public identifiers can be resolved at all).

Public identifiers have two disadvantages:



Public Identifiers

An important characteristic of public identifiers is that they are globally unique. Referring to a document with a public identifier should mean that the identifier will resolve to the same actual document on any system even though the location of that document on each system may vary. As a rule, you should never reuse public identifiers, and a published revision should have a new public identifier. Not following these rules defeats one purpose of the public identifier.

A public identifier can be any string of upper- and lowercase letters, digits, any of the following symbols: “'”, “(“, “)”, “+”, “,”, “-”, “.”, “/”, “:”, “=”, “?”, and white space, including line breaks.

Formal public identifiers

Most public identifiers conform to the ISO 8879 standard that defines formal public identifiers. Formal public identifiers, frequently referred to as FPI, have a prescribed format that can ensure uniqueness:[2]

prefix//owner-identifier//
text-class text-description//
language//display-version

Here are descriptions of the identifiers in this string:

prefix

The prefix is either a “+” or a “-” Registered public identifiers begin with “+”; unregistered identifiers begin with “-”.

(ISO standards sometimes use a third form beginning with ISO and the standard number, but this form is only available to ISO.)

The purpose of registration is to guarantee a unique owner-identifier. There are few authorities with the power to issue registered public identifiers, so in practice unregistered identifiers are more common.

The Graphics Communication Association (GCA) can assign registered public identifiers. They do this by issuing the applicant a unique string and declaring the format of the owner identifier. For example, the Davenport Group was issued the string “A00002” and could have published DocBook using an FPI of the following form:


+//ISO/IEC 9070/RA::A00002//...


Another way to use a registered public identifier is to use the format reserved for internet domain names. For example, O'Reilly can issue documents using an FPI of the following form:


+//IDN oreilly.com//...


As of DocBook V3.1, the OASIS Technical Committee responsible for DocBook has elected to use the unregistered owner identifier, OASIS, thus its prefix is -.


-//OASIS//...


owner-identifier

Identifies the person or organization that owns the identifier. Registration guarantees a unique owner identifier. Short of registration, some effort should be made to ensure that the owner identifier is globally unique. A company name, for example, is a reasonable choice as are Internet domain names. It's also not uncommon to see the names of individuals used as the owner-identifier, although clearly this may introduce collisions over time.

The owner-identifier for DocBook V3.1 is OASIS. Earlier versions used the owner-identifier Davenport.

text-class

The text class identifies the kind of document that is associated with this public identifier. Common text classes are

DOCUMENT

An SGML or XML document.

DTD

A DTD or part of a DTD.

ELEMENTS

A collection of element declarations.

ENTITIES

A collection of entity declarations.

NONSGML

Data that is not in SGML or XML.



DocBook is a DTD, thus its text class is DTD.

text-description

This field provides a description of the document. The text description is free-form, but cannot include the string //.

The text description of DocBook is DocBook V3.1.

In the uncommon case of unavailable public texts (FPIs for proprietary DTDs, for example), there are a few other options available (technically in front of or in place of the text description), but they're rarely used. [3]

language

Indicates the language in which the document is written. It is recommended that the ISO standard two-letter language codes be used if possible.

DocBook is an English-language DTD, thus its language is EN.

display-version

This field, which is not frequently used, distinguishes between public texts that are the same except for the display device or system to which they apply.

For example, the FPI for the ISO Latin 1 character set is:

-//ISO 8879-1986//ENTITIES Added Latin 1//EN


A reasonable FPI for an XML version of this character set is:

-//ISO 8879-1986//ENTITIES Added Latin 1//EN//XML




System Identifiers

System identifiers are usually filenames on the local system. In SGML, there's no constraint on what they can be. Anything that your SGML processing system recognizes is allowed. In XML, system identifiers must be URIs (Uniform Resource Identifiers).

The use of URIs as system identifiers introduces the possibility that a system identifier can be a URN. This allows the system identifier to benefit from the same global uniqueness benefit as the public identifier. It seems likely that XML system identifiers will eventually move in this direction.

Catalog Files

Catalog files are the standard mechanism for resolving public identifiers into system identifiers. Some resolution mechanism is necessary because DocBook refers to its component modules with public identifiers, and those must be mapped to actual files on the system before any piece of software can actually load them.

The catalog file format was defined in 1994 by SGML Open (now OASIS). The formal specification is contained in OASIS Technical Resolution 9401:1997.

Informally, a catalog is a text file that contains a number of keyword/value pairs. The most frequently used keywords are PUBLIC, SYSTEM, SGMLDECL, DTDDECL, CATALOG, OVERRIDE, DELEGATE, and DOCTYPE.

PUBLIC

The PUBLIC keyword maps public identifiers to system identifiers:


PUBLIC "-//OASIS//DTD DocBook V3.1//EN" "docbook/3.1/docbook.dtd"
SYSTEM

The SYSTEM keyword maps system identifiers to system identifiers:


SYSTEM "http://nwalsh.com/docbook/xml/1.3/db3xml.dtd"
    "docbook/xml/1.3/db3xml.dtd"
SGMLDECL

The SGMLDECL keyword identifies the system identifier of the SGML Declaration that should be used:


SGMLDECL "docbook/3.1/docbook.dcl"
DTDDECL

Like SGMLDECL, DTDDECL identifies the SGML Declaration that should be used. DTDDECL associates a declaration with a particular public identifier for a DTD:

DTDDECL "-//OASIS//DTD DocBook V3.1//EN" "docbook/3.1/docbook.dcl"

Unfortunately, it is not supported by the free tools that are available. The practical benefit of DTDDECL can usually be achieved, albeit in a slightly cumbersome way, with multiple catalog files.

CATALOG

The CATALOG keyword allows one catalog to include the content of another. This can make maintenance somewhat easier and allows a system to directly use the catalog files included in DTD distributions. For example, the DocBook distribution includes a catalog file. Rather than copying each of the declarations in that catalog into your system catalog, you can simply include the contents of the DocBook catalog:

CATALOG "docbook/3.1/catalog"
OVERRIDE

The OVERRIDE keyword indicates whether or not public identifiers override system identifiers. If a given declaration includes both a system identifer and a public identifier, most systems attempt to process the document referenced by the system identifier, and consequently ignore the public identifier. Specifying

OVERRIDE YES
in the catalog informs the processing system that resolution should be attempted first with the public identifier.

DELEGATE

The DELEGATE keyword allows you to specify that some set of public identifiers should be resolved by another catalog. Unlike the CATALOG keyword, which loads the referenced catalog, DELEGATE does nothing until an attempt is made to resolve a public identifier.

The DELEGATE entry specifies a partial public identifier and an alternate catalog:

DELEGATE "-//OASIS" "/usr/sgml/oasis/catalog"


Partial public identifers are simply initial substring matches. Given the preceding entry, if an attempt is made to match any public identifier that begins with the string -//OASIS, the alternate catalog /usr/sgml/oasis/catalog will be used instead of the current catalog.

DOCTYPE

The DOCTYPE keyword allows you to specify a default system identifier. If an SGML document begins with a DOCTYPE declaration that specifies neither a public identifier nor a system identifier (or is missing a DOCTYPE declaration altogether), the DOCTYPE declaration may provide a default:


DOCTYPE BOOK n:/share/sgml/docbook/3.1/docbook.dtd

A small fragment of an actual catalog file is shown in Example 2-1.

Example 2-1. A Sample Catalog

                                                           (1)
-- Comments are delimited by pairs of double-hyphens,
   as in SGML and XML comments. --
                                                          (2)
OVERRIDE YES
                                                          (3)
SGMLDECL "n:/share/sgml/docbook/3.1/docbook.dcl"
                                                          (4)
DOCTYPE  BOOK  n:/share/sgml/docbook/3.1/docbook.dtd
                                                          (5)
PUBLIC "-//OASIS//DTD DocBook V3.1//EN" 
  n:/share/sgml/docbook/3.1/docbook.dtd
                                                          (6)
SYSTEM "http://nwalsh.com/docbook/xml/1.3/db3xml.dtd"
  n:/share/sgml/Norman_Walsh/db3xml/db3xml.dtd
(1)
Catalog files may also include comments.
(2)
This catalog specifies that public identifiers should be used in favor of system identifiers, if both are present.
(3)
The default declaration specified by this catalog is the DocBook declaration.
(4)
Given an explicit (or implied) SGML DOCTYPE of

<!DOCTYPE BOOK SYSTEM>

use n:/share/sgml/docbook/3.1/docbook.dtd as the default system identifier. Note that this can only apply to SGML documents because the DOCTYPE declaration above is not a valid XML element.
(5)
Map the OASIS public identifer to the local copy of the DocBook V3.1 DTD.
(6)
Map a system identifer for the XML version of DocBook to a local version.

A few notes:

Locating catalog files

Catalog files go a long way towards making documents more portable by introducing a level of indirection. A problem still remains, however: how does a processor locate the appropriate catalog file(s)? OASIS outlines a complete interchange packaging scheme, but for most applications the answer is simply that the processor looks for a file called catalog or CATALOG.

Some applications allow you to specify a list of directories that should be examined for catalog files. Other tools allow you to specify the actual files.

Note that even if a list of directories or catalog files is provided, applications may still load catalog files that occur in directories in which other documents are found. For example, SP and Jade always load the catalog file that occurs in the directory in which a DTD or document resides, even if that directory is not on the catalog file list.

Notes

[1]

This is not absolutely true. SGML allows for the possibility that the reference could be implied by the application, but this is very rarely the case.

[2]

Essentially, it can ensure that two different owners won't accidentally tread on each other. Nothing can prevent a given owner from reusing public identifiers, except maybe common sense.

[3]

See Appendix A of [maler96], for more details.