Basic SGML/XML Concepts

Here are the basic SGML/XML concepts you need to grasp:

Structured and Semantic Markup

An essential characteristic of structured markup is that it explicitly distinguishes (and accordingly “marks up” within a document) the structure and semantic content of a document. It does not mark up the way in which the document will appear to the reader, in print or otherwise.

In the days before word processors it was common for a typed manuscript to be submitted to a publisher. The manuscript identified the logical structures of the documents (chapters, section titles, and so on), but said nothing about its appearance. Working independently of the author, a designer then developed a specification for the appearance of the document, and a typesetter marked up and applied the designer's format to the document.

Because presentation or appearance is usually based on structure and content, SGML markup logically precedes and generally determines the way a document will look to a reader. If you are familiar with strict, simple HTML markup, you know that a given document that is structurally the same can also look different on different computers. That's because the markup does not specify many aspects of a document's appearance, although it does specify many aspects of a document's structure.

Many writers type their text into a word processor, line-by-line and word-for-word, italicizing technical terms, underlining words for emphasis, or setting section headers in a font complementary to the body text, and finally, setting the headers off with a few carriage returns fore and aft. The format such a writer imposes on the words on the screen imparts structure to the document by changing its appearance in ways that a reader can more or less reliably decode. The reliability depends on how consistently and unambiguously the changes in type and layout are made. By contrast, an SGML/XML markup of a section header explicitly specifies that a specific piece of text is a section header. This assertion does not specify the presentation or appearance of the section header, but it makes the fact that the text is a section header completely unambiguous.

SGML and XML use named elements, delimited by angle brackets (“<” and “>”) to identify the markup in a document. In DocBook, a top-level section is sect1, so the title of a top-level section named My First-Level Header would be identified like this:

<sect1><title>My First-Level Header</title> 

Note the following features of this markup:

Clarity

A title begins with title and ends with title. The sect1 also has an ending sect1, but we haven't shown the whole section so it's not visible.

Hierarchy

“My First-Level Header” is the title of a top-level section because it occurs inside a title in a sect1. A title element occurring somewhere else, say in a Chapter element, would be the title of the chapter.

Plain text

SGML documents can have varying character sets, but most are ASCII. XML documents use the Unicode character set. This makes SGML and XML documents highly portable across systems and tools.

In an SGML document, there is no obligatory difference between the size or face of the type in a first-level section header and the title of a book in a footnote or the first sentence of a body paragraph. All SGML files are simple text files without font changes or special characters.[1] Similarly, an SGML document does not specify the words in a text that are to be set in italic, bold, or roman type. Instead, SGML marks certain kinds of texts for their semantic content. For example, if a particular word is the name of a file, then the tags around it should specify that it is a filename:

Many mail programs read configuration information from the
users filename.mailrcfilename file.

If the meaning of a phrase is particularly audacious, it might get tagged for boldness of thought instead of appearance. An SGML document contains all the information that a typesetter needs to lay out and typeset a printed page in the most effective and consistent way, but it does not specify the layout or the type.[2]

Not only is the structure of an SGML/XML document explicit, but it is also carefully controlled. An SGML document makes reference to a set of declarations—a document type definition (DTD)—that contains an inventory of tag names and specifies the combination rules for the various structural and semantic features that make up a document. What the distinctive features are and how they should be combined is “arbitrary” in the sense that almost any selection of features and rules of composition is theoretically possible. The DocBook DTD chooses a particular set of features and rules for its users.

Here is a specific example of how the DocBook DTD works. DocBook specifies that a third-level section can follow a second-level section but cannot follow a first-level section without an intervening second-level section.

This is valid:

<sect1><title>...</title>
  <sect2><title>...</title>
    <sect3><title>...</title>
      ...
    </sect3>
  </sect2>
</sect1>

This is not:

<sect1><title>...</title>
  <sect3><title>...</title>
    ...
  </sect3>
</sect1>

Because an SGML/XML document has an associated DTD that describes the valid, logical structures of the document, you can test the logical structure of any particular document against the DTD. This process is performed by a parser. An SGML processor must begin by parsing the document and determining if it is valid, that is, if it conforms to the rules specified in the DTD. XML processors are not required to check for validity, but it's always a good idea to check for validity when authoring. Because you can test and validate the structure of an SGML/XML document with software, a DocBook document containing a first-level section followed immediately by a third-level section will be identified as invalid, meaning that it's not a valid instance or example of a document defined by the DocBook DTD. Presumably, a document with a logical structure won't normally jump from a first- to a third-level section, so the rule is a safeguard—but not a guarantee—of good writing, or at the very least, reasonable structure. A parser also verifies that the names of the tags are correct and that tags requiring an ending tag have them. This means that a valid document is also one that should format correctly, without runs of paragraphs incorrectly appearing in bold type or similar monstrosities that everyone has seen in print at one time or another. For more information about SGML/XML parsers, see Chapter 3.

In general, adherence to the explicit rules of structure and markup in a DTD is a useful and reassuring guarantee of consistency and reliability within documents, across document sets, and over time. This makes SGML/XML markup particularly desirable to corporations or governments that have large sets of documents to manage, but it is a boon to the individual writer as well.

How can this markup help you?

Semantic markup makes your documents more amenable to interpretation by software, especially publishing software. You can publish a white paper, authored as a DocBook Article, in the following formats:

  • On the Web in HTML

  • As a standalone document on 8½×11 paper

  • As part of a quarterly journal, in a 6×9 format

  • In Braille

  • In audio

You can produce each of these publications from exactly the same source document using the presentational techniques best suited to both the content of the document and the presentation medium. This versatility also frees the author to concentrate on the document content. For example, as we write this book, we don't know exactly how O'Reilly will choose to present chapter headings, bulleted lists, SGML terms, or any of the other semantic features. And we don't care. It's irrelevant; whatever presentation is chosen, the SGML sources will be transformed automatically into that style.

Semantic markup can relieve the author of other, more significant burdens as well (after all, careful use of paragraph and character styles in a word processor document theoretically allows us to change the presentation independently from the document). Using semantic markup opens up your documents to a world of possibilities. Documents become, in a loose sense, databases of information. Programs can compile, retrieve, and otherwise manipulate the documents in predictable, useful ways.

Consider the online version of this book: almost every element name (Article, Book, and so on) is a hyperlink to the reference page that describes that element. Maintaining these links by hand would be tedious and might be unreliable, as well. Instead, every element name is marked as an element using SGMLTag: a Book is a sgmltagBooksgmltag.

Because each element name in this book is tagged semantically, the program that produces the online version can determine which occurrences of the word “book” in the text are actually references to the Book element. The program can then automatically generate the appropriate hyperlink when it should.

There's one last point to make about the versatility of SGML documents: how much you have depends on the DTD. If you take a good photo with a high resolution lens, you can print it and copy it and scan it and put it on the Web, and it will look good. If you start with a low-resolution picture it will not survive those transformations so well. DocBook SGML/XML has this advantage over, say, HTML: DocBook has specific and unambiguous semantic and structural markup, because you can convert its documents with ease into other presentational forms, and search them more precisely. If you start with HTML, whose markup is at a lower resolution than DocBook's, your versatility and searchability is substantially restricted and cannot be improved.

What are the shortcomings to structural authoring?

There are a few significant shortcomings to structured authoring:

  • It requires a significant change in the authoring process. Writing structured documents is very different from writing with a typical word processor, and change is difficult. In particular, authors don't like giving up control over the appearance of their words especially now that they have acquired it with the advent of word processors. But many publishing companies need authors to relinquish that control, because book design and production remains their job, not their authors'.

  • Because semantics are separate from appearance, in order to publish an SGML/XML document, a stylesheet or other tool must create the presentational form from the structural form. Writing stylesheets is a skill in its own right, and though not every author among a group of authors has to learn how to write them, someone has to.

  • Authoring tools for SGML documents can generally be pretty expensive. While it's not entirely unreasonable to edit SGML/XML documents with a simple text editor, it's a bit tedious to do so. However, there are a few free tools that are SGML-aware. The widespread interest in XML may well produce new, clever, and less expensive XML editing tools.

Notes

[1]

Some structured editors apply style to the document while it's being edited, using fonts and color to make the editing task easier, but this stylistic information is not stored in the actual SGML/XML document. Instead, it is provided by the editing application.

[2]

The distinction between appearance or presentation and structure or content is essential to SGML, but there is a way to specify the appearance of an SGML document: attach a stylesheet to it. There are several standards for such stylesheets: CSS, XSL, FOSIs, and DSSSL. See Chapter 4.