When a DTD or other external file is referenced from a document, the reference can be specified in three ways: using a public identifier, a system identifier, or both. In XML, the system identifier is generally required and the public identifier is optional. In SGML, neither is required, but at least one must be present.[1]
A public identifier is a globally unique, abstract name, such as the following, which is the official public identifier for DocBook V3.1:
-//OASIS//DTD DocBook V3.1//EN |
The introduction of XML has added some small complications to system identifiers. In SGML, a system identifier generally points to a single, local version of a file using local system conventions. In XML, it must point with a Uniform Resource Indicator (URI). The most common URI today is the Uniform Resource Locator (URL), which is familiar to anyone who browses the Web. URLs are a lot like SGML system identifiers, because they generally point to a single version of a file on a particular machine. In the future, Uniform Resource Names (URN), another form of URI, will allow XML system identifiers to have the abstract characteristics of public identifiers.
The following filename is an example of an SGML system identifier:
/usr/local/sgml/docbook/3.1/docbook.dtd |
file:///usr/local/sgml/docbook/3.1/docbook.dtd |
The advantage of using the public identifier is that it makes your documents more portable. For any system on which DocBook is installed, the public identifier will resolve to the appropriate local version of the DTD (if public identifiers can be resolved at all).
Public identifiers have two disadvantages:
Because XML does not require them, and because system identifiers are required, developing XML tools may not provide adequate support for public identifiers. To work with these systems you must use system identifiers.
Public identifiers aren't magical. They're simply a method of indirection. For them to work, there must be a resolution mechanism for public identifiers. Luckily, several years ago, SGML Open (now OASIS) described a standard mechanism for mapping public identifiers to system identifers using catalog files.
See OASIS Technical Resolution 9401:1997 (Amendment 2 to TR 9401).
An important characteristic of public identifiers is that they are globally unique. Referring to a document with a public identifier should mean that the identifier will resolve to the same actual document on any system even though the location of that document on each system may vary. As a rule, you should never reuse public identifiers, and a published revision should have a new public identifier. Not following these rules defeats one purpose of the public identifier.
A public identifier can be any string of upper- and lowercase letters, digits, any of the following symbols: “'”, “(“, “)”, “+”, “,”, “-”, “.”, “/”, “:”, “=”, “?”, and white space, including line breaks.
Most public identifiers conform to the ISO 8879 standard that defines formal public identifiers. Formal public identifiers, frequently referred to as FPI, have a prescribed format that can ensure uniqueness:[2]
prefix//owner-identifier// text-class text-description// language//display-version |
Here are descriptions of the identifiers in this string:
The prefix is either a “+” or a “-” Registered public identifiers begin with “+”; unregistered identifiers begin with “-”.
(ISO standards sometimes use a third form beginning with ISO and the standard number, but this form is only available to ISO.)
The purpose of registration is to guarantee a unique owner-identifier. There are few authorities with the power to issue registered public identifiers, so in practice unregistered identifiers are more common.
The Graphics Communication Association (GCA) can assign registered public identifiers. They do this by issuing the applicant a unique string and declaring the format of the owner identifier. For example, the Davenport Group was issued the string “A00002” and could have published DocBook using an FPI of the following form:
+//ISO/IEC 9070/RA::A00002//... |
Another way to use a registered public identifier is to use the format reserved for internet domain names. For example, O'Reilly can issue documents using an FPI of the following form:
+//IDN oreilly.com//... |
As of DocBook V3.1, the OASIS Technical Committee responsible for DocBook has elected to use the unregistered owner identifier, OASIS, thus its prefix is -.
-//OASIS//... |
Identifies the person or organization that owns the identifier. Registration guarantees a unique owner identifier. Short of registration, some effort should be made to ensure that the owner identifier is globally unique. A company name, for example, is a reasonable choice as are Internet domain names. It's also not uncommon to see the names of individuals used as the owner-identifier, although clearly this may introduce collisions over time.
The owner-identifier for DocBook V3.1 is OASIS. Earlier versions used the owner-identifier Davenport.
The text class identifies the kind of document that is associated with this public identifier. Common text classes are
An SGML or XML document.
A DTD or part of a DTD.
A collection of element declarations.
A collection of entity declarations.
Data that is not in SGML or XML.
DocBook is a DTD, thus its text class is DTD.
This field provides a description of the document. The text description is free-form, but cannot include the string //.
The text description of DocBook is DocBook V3.1.
In the uncommon case of unavailable public texts (FPIs for proprietary DTDs, for example), there are a few other options available (technically in front of or in place of the text description), but they're rarely used. [3]
Indicates the language in which the document is written. It is recommended that the ISO standard two-letter language codes be used if possible.
DocBook is an English-language DTD, thus its language is EN.
This field, which is not frequently used, distinguishes between public texts that are the same except for the display device or system to which they apply.
For example, the FPI for the ISO Latin 1 character set is:
-//ISO 8879-1986//ENTITIES Added Latin 1//EN |
A reasonable FPI for an XML version of this character set is:
-//ISO 8879-1986//ENTITIES Added Latin 1//EN//XML |
System identifiers are usually filenames on the local system. In SGML, there's no constraint on what they can be. Anything that your SGML processing system recognizes is allowed. In XML, system identifiers must be URIs (Uniform Resource Identifiers).
The use of URIs as system identifiers introduces the possibility that a system identifier can be a URN. This allows the system identifier to benefit from the same global uniqueness benefit as the public identifier. It seems likely that XML system identifiers will eventually move in this direction.
Catalog files are the standard mechanism for resolving public identifiers into system identifiers. Some resolution mechanism is necessary because DocBook refers to its component modules with public identifiers, and those must be mapped to actual files on the system before any piece of software can actually load them.
The catalog file format was defined in 1994 by SGML Open (now OASIS). The formal specification is contained in OASIS Technical Resolution 9401:1997.
Informally, a catalog is a text file that contains a number of keyword/value pairs. The most frequently used keywords are PUBLIC, SYSTEM, SGMLDECL, DTDDECL, CATALOG, OVERRIDE, DELEGATE, and DOCTYPE.
The PUBLIC keyword maps public identifiers to system identifiers:
PUBLIC "-//OASIS//DTD DocBook V3.1//EN" "docbook/3.1/docbook.dtd" |
The SYSTEM keyword maps system identifiers to system identifiers:
SYSTEM "http://nwalsh.com/docbook/xml/1.3/db3xml.dtd" "docbook/xml/1.3/db3xml.dtd" |
The SGMLDECL keyword identifies the system identifier of the SGML Declaration that should be used:
SGMLDECL "docbook/3.1/docbook.dcl" |
Like SGMLDECL, DTDDECL identifies the SGML Declaration that should be used. DTDDECL associates a declaration with a particular public identifier for a DTD:
DTDDECL "-//OASIS//DTD DocBook V3.1//EN" "docbook/3.1/docbook.dcl" |
Unfortunately, it is not supported by the free tools that are available. The practical benefit of DTDDECL can usually be achieved, albeit in a slightly cumbersome way, with multiple catalog files.
The CATALOG keyword allows one catalog to include the content of another. This can make maintenance somewhat easier and allows a system to directly use the catalog files included in DTD distributions. For example, the DocBook distribution includes a catalog file. Rather than copying each of the declarations in that catalog into your system catalog, you can simply include the contents of the DocBook catalog:
CATALOG "docbook/3.1/catalog" |
The OVERRIDE keyword indicates whether or not public identifiers override system identifiers. If a given declaration includes both a system identifer and a public identifier, most systems attempt to process the document referenced by the system identifier, and consequently ignore the public identifier. Specifying
OVERRIDE YES |
The DELEGATE keyword allows you to specify that some set of public identifiers should be resolved by another catalog. Unlike the CATALOG keyword, which loads the referenced catalog, DELEGATE does nothing until an attempt is made to resolve a public identifier.
The DELEGATE entry specifies a partial public identifier and an alternate catalog:
DELEGATE "-//OASIS" "/usr/sgml/oasis/catalog" |
Partial public identifers are simply initial substring matches. Given the preceding entry, if an attempt is made to match any public identifier that begins with the string -//OASIS, the alternate catalog /usr/sgml/oasis/catalog will be used instead of the current catalog.
The DOCTYPE keyword allows you to specify a default system identifier. If an SGML document begins with a DOCTYPE declaration that specifies neither a public identifier nor a system identifier (or is missing a DOCTYPE declaration altogether), the DOCTYPE declaration may provide a default:
DOCTYPE BOOK n:/share/sgml/docbook/3.1/docbook.dtd |
A small fragment of an actual catalog file is shown in Example 2-1.
Example 2-1. A Sample Catalog
<!DOCTYPE BOOK SYSTEM> |
A few notes:
It's not uncommon to have several catalog files. See below, the Section called Locating catalog files”.
Like attributes on elements you can quote, the public identifier and system identifier are surrounded by either single or double quotes.
White space in the catalog file is generally irrelevant. You can use spaces, tabs, or new lines between keywords and their arguments.
When a relative system identifier is used, it is considered to be relative to the location of the catalog file, not the document being processed.
Catalog files go a long way towards making documents more portable by introducing a level of indirection. A problem still remains, however: how does a processor locate the appropriate catalog file(s)? OASIS outlines a complete interchange packaging scheme, but for most applications the answer is simply that the processor looks for a file called catalog or CATALOG.
Some applications allow you to specify a list of directories that should be examined for catalog files. Other tools allow you to specify the actual files.
Note that even if a list of directories or catalog files is provided, applications may still load catalog files that occur in directories in which other documents are found. For example, SP and Jade always load the catalog file that occurs in the directory in which a DTD or document resides, even if that directory is not on the catalog file list.
[1] |
This is not absolutely true. SGML allows for the possibility that the reference could be implied by the application, but this is very rarely the case. |
[2] |
Essentially, it can ensure that two different owners won't accidentally tread on each other. Nothing can prevent a given owner from reusing public identifiers, except maybe common sense. |
[3] |
See Appendix A of [maler96], for more details. |