(Originally published in TechComm, Vol. 27, No. 4, April 1999)

Get on Board the XML Train

by William H. DuBay

Introduction

Extensible Markup Language (XML), a simplified version of Standard Generalized Markup Language (SGML) is now an industry standard for delivering content on the Internet. It has implications that go far beyond the Web, however, and will rapidly influence the way we create and publish information.

The next century will be an XML century, make no mistake about it. All our documents, even checks, credit card slips, personal letters, recipes, technical documents, everything, will benefit from XML technologies. Students are already learning XML in schools, and big businesses are using it to publish their databases on the web. The appearance of the electronic spreadsheet ten years ago changed the way we do business. XML will change the way we write documents.

An engineer asked me recently which tools to use for a new documentation project he is starting. I said, "Whatever tools you choose, get ready for XML. Within a few months, all the tools will support XML. Start defining the structure of information you need. The XML train is just pulling out of the station. You want to be on board."

What is Generalized Markup?

Like its older sibling SGML, XML uses tags to specify the internal structure of a document. At first, the tags look a lot like those in Hypertext Markup Language (HTML), but they are different in many ways. HTML uses tags for formatting pages on the Web. Here is an example of how part of this article might look in HTML:

<BODY>
<H1>Get on Board the XML Train</H1>
<H4>by Bill DuBay</H4>
<H3>Introduction</H3>
<P>Extensible Markup Language (XML)...
</BODY>

Since we started reading, we have been using things like formatting, indexes, tables of contents, chapters, section headers, scenes, paragraphs, lists, tables, etc. as clues to figure out the underlying structure of information. While the formatting commands of HTML, like the header <H1> and paragraph <P> tags, imply the underlying structure, XML makes it much more explicit. It identifies "what it is" rather than what it should look like. Here is how you could mark up the same article in XML:

<ARTICLE> <FRONT> <TITLE>Get on Board the XML Train</TITLE>
<AUTHOR>by Bill DuBay</AUTHOR>
</FRONT>
<MAIN> <HEADING1>Introduction</HEADING1>
<PARA>Extensible Markup Language (XML)... </PARA>
...
</MAIN>
...
</ARTICLE>

One of the first things we notice is that XML tagging (called generalized markup) identifies the hierarchical structure of the information by showing the relationships between different items (called elements in XML). This structure expresses the dependencies between different items of information. As in real life, every element of information is part of some other element. Learning is basically the process of finding where something new fits into what we already know.

The next thing we notice is that, unlike HTML, XML lets us define our own tags for our own documents. This is what makes XML "eXtensible."

A set of tags consistently used throughout a set of similar documents constitutes an XML language or schema. For a look at some XML languages already in use, go to the schema Web site at http://www.schema.net. There are XML languages for different branches of science and for mathematics, education, software, commerce, and multimedia.

The rest of this article is about:

There is also a hands-on section describing:

Data Interchange

The most important advantage that drives business use of generalized markup is that it provides an open, non-binary, platform-independent language for the exchange of information between different programs and systems. Previously, large companies would spend months and millions of dollars trying to get their systems to communicate with those of their suppliers and distributors. They leased expensive, private lines. Hospitals typically had enormous problems. They would access the patient records from large databases, print them out, and then re-key them into their own accounting systems. XML now makes this task as simple as using the mouse to drag-and-drop information from one window into another.

Large industries such as the aerospace, automotive, telecommunications, and computer and software industries are moving from private lines to the Internet to conduct business. They use hub or middle-tier systems to exchange data with the generalized markup of SGML or XML as the single input-and-output format. This was, in fact, the main purpose for which SGML was invented.

The simpler-to-use XML merely extends this proven technology to individuals and small businesses. Within a few months, we can expect most applications will support XML markup, providing a universal language for exchanging data between all computers, operating systems, and computers.

Reusability

The second important advantage of generalized markup is that it makes the internal structure of documents evident to machines, making them much more reusable.

If easy data interchange is saving big companies big bucks, it is the reusability features of XML that will save small companies the most money in the long run. Companies waste an enormous amount money re-writing the same things over and over again because they have no way of identifying and storing individual items of information. Generalized markup addresses that problem.

Once you give a name to each element in a document, you can then address that element as a separate document and re-use it over and over again. Experience has shown that the more completely the structure of a document is tagged, the more reusable it becomes.

The editors of the Oxford English Dictionary, for example, put an enormous, up-front investment into SGML tagging of all the elements in their dictionary. Now, they can create new books and articles on the fly by using the tags to retrieve specified information, like all the new English words invented in the American colonies between 1600 and 1800.

Classification

Another important advantage of generalized markup is its ability to classify individual elements of a document for easy search and retrieval. It makes them much more reusable by associating them with similar items in other documents. In so doing, it identifies the class to which the elements belong. For example, the <AUTHOR> elements in different articles all belong to the same class and enjoy a certain proximity to one another.

If you are familiar with databases, you can recognize the usefulness of this type of classification. If you create a number of XML documents using markup, you can search the class for specific data. For example, if you could search the <AUTHOR> elements to retrieve all the articles by authors who live in Orange County or New York City. You could then select various elements from the articles retrieved by your search, and re-combine them to create a new articles.

Many industries, such as the health industry, already reap enormous benefits from markup classification. For example, hospital personnel can quickly search a patient’s records for elements such as <allergies> or <drug_reactions> to find critical data.

Rules-Based Documents

To standardize markup across a large body of information, you need a way of enforcing the rules for tagging. With XML, you can embody the rules for marking up each class of documents with a Document Type Definition (DTD). You could have, for example, a DTD that describes the rules for all the documents in the <ARTICLE> documents like the one above. You could specify that each <AUTHOR> element include at least one <NAME> and optionally, one <STC_STATUS> and one <STC_CHAP>. In XML, the document type declared in the DTD must be identical with the name of the root element in the XML document. In this case, it would be "ARTICLE".

The DTD defines the names of each element, which elements are required and which ones are optional, and how they are laid out and organized. One of the big differences between SGML and XML is that in SGML, the DTD is required, while in XML it is optional. In XML, the DTD can be intentional (e.g., in your head), spelled out formally in a separate document, embedded in the current XML document itself, or declared partially in a separate document and partially in the current one. You can use the internal DTD for declaring elements that are specific to a document or for overriding elements declared in the external DTD.

An XML document is said to be well formed if it conforms to the basic requirements of XML (such as case sensitivity, one unique root element, nested elements that do not overlap, and a closing tag for every start tag). A document is said to be validated when it conforms to its own DTD. If you have a program with a validating parser, like the one in the beta version of Microsoft’s Internet Explorer 5, it checks the XML document to make sure it conforms to both the rules of XML and of a DTD if one is specified. If the document is not well formed, or if a DTD is specified and the document does not conform to it, the browser displays the error and does not display the document.

You should be aware that there are several other proposed technologies for validating XML documents besides the DTD carried over from SGML. One is proposed by Microsoft, another by Netscape. The World Wide Web Consortium seems to be settling on a compromise called Resource Description Framework (RDF) which is somewhat simpler than a DTD.

XML Style Sheets

Because XML removes the formatting information from the structure, it opens enormous possibilities for formatting and outputting documents, a process we now call presentation or rendition. What SGML and XML makes possible is the separation of the content of a document from its rendition.

One of the features driving the use of XML on the Web is the ability to customize XML documents for the preferences of individual users. This requirement was summed up by Matthew Fuchs of Disney Imagineering when he said, "Information needs to know about itself, and information needs to know about me."

I run into information that knows about me each time I log on to amazon.com. "Welcome back, William H. DuBay," it says. "Check out these book recommendations." The recommendations are so well targeted that they frequently inspire me to make a purchase. The system uses the marked-up data of my past purchases to search marked-up data to find books it knows I will like.

Amazon.com has invested deeply in XML technology and has a subsidiary, Junglee.com, which sells that technology to other companies. Junglee treats the whole Web like a virtual database. Examples Junglee’s work are on the job-search Web sites of companies like the Boston Globe. These sites gather, parse, and mark up information from other web sites to create XML databases. When you click on "Technical Writer," the system searches all the XML databases to find the job you need. In another application, available on Junglee’s Web site, a search for a book will query several different online bookstores at once.

Big companies such as amazon.com use a combination of XML, SGML, and complex programs to output highly customized content on the web. Smaller companies and individuals also have a lot of options. They can put HTML tags right in XML documents, or they can put XML tags in HTML documents. The most powerful and flexible option, however, is using a separate stylesheet.

With XML stylesheets, we can not only describe the format of the elements in a document, we can also re-arrange them, suppress them, and mix them with elements from other documents. Stylesheets can also provide different views of the same XML document. For example, we can use a stylesheet to deliver an installation guide in multiple languages, which the user can switch from one language to another. Or we can deliver a list of products that the user can quickly sort by price or availability. The new-found ability of documents to quickly re-arrange, and re-format, and re-publish themselves in response to a user’s response will change the way we think about usability and audience.

The types of stylesheets that can be used with XML documents include Cascading Style Sheets (CSS) developed for HTML, Document Style Semantics and Specification Language stylesheets (DSSSL) developed for SGML, and Microsoft’s Extensible Style Language (XSL), made for XML. Which stylesheet you use depends, of course, on the type supported by the tools you use. Adobe’s FrameMaker +SGML, for example, uses an Element Definition Document (EDD). Microsoft’s Internet Explorer 5 supports both XSL and CSS stylesheets.

While the standards for XML itself are relatively simple and stable, the standards for the stylesheets are undergoing considerable development even as I write. Considering the ability of the stylesheets to make self-customizing documents, experts are preparing a flood of books on the subject.

Jump on Board

Although the specification for XML itself is established and stable, the specifications for the supporting technologies are still being developed by the World Wide Web Consortium. There is, however, an abundance of information about XML on the Web. Do a web search on "XML" and you will get lots of XML sites, tools, specifications, and software, some of them very primitive and free, some of them very sophisticated and expensive.

I recommend The XML Handbook (Prentice-Hall) by Charles Goldfarb and Paul Prescod. Besides featuring a number of XML tutorials, applications and demos, this book and the CD ROM that comes with it feature a number of case studies showing how organizations are already using SGML and XML.

Many people feel that XML will make us better writers because of its focus on the organization and mapping of information. Determining what elements of a document are responsible for what information is something only a human brain can do, and something writers do exceptionally well. As corporations discover the myriad utility of XML, they will increasingly call upon writers to analyze and organize information.

When I asked Charles Goldfarb, the international standards editor who invented SGML, how XML will affect the field of technical communication, he responded:

"Although XML is really designed for machine-to-machine data interchange, it is safe to say that many people will use it for human-written documents as well. Marshall McLuhan said, ‘The medium is the message.’ Professional technical communicators know the truth: the abstract idea is the message. The medium, like other aspects of the physical rendition, influence the reader’s perception of the message, but could distort that message just as easily as it could enhance it. XML will make more people (and more tools) aware of the critical distinction between the abstraction and the rendition. That in turn should enable the technical communicator to work more effectively."

XML Hands On

Basic Parts of an XML Document

An XML document is made up of markup and data. Markup consists of tags, consisting of the two <>pointers surrounding some XML statement. Data is the information we want to publish.

You can start an XML document with a prolog, which can contain an XML declaration like this:

<?xml version="1.0"?>

A prolog can also contain an optional document type declaration that references document’s DTD and an optional stylesheet processing instruction that pulls in the stylesheet. The following example declares that the current XML document conforms to the COPIER Document Type Definition which is located in the current directory in a file called "copier.dtd." It references an XSL stylesheet, "copier.xsl" that has the rules for formatting and outputting the document.

<!DOCTYPE COPIER SYSTEM "copier.dtd">

<?xml:stylesheet href="copier.xsl type="text/xsl"?>

The body of the document usually consists of a number of structured XML elements. An element is made up of some data (usually text) surrounded by a start tag and an end tag. Both tags must contain a name, like the word "author" in this element:

<author>William Rankin<\author>

A element tag can also contain an attribute, which generally specifies any kind of information you want to associate with this element but usually don’t want to publish it as part of the data. An attribute contains a name, an equals sign, and a value in quotes. This <author> element has an attribute that specifies John Smith’s email address:

<author email="wrankin@home.com">William Rankin<\author>

Besides elements, XML documents also make use of entities, which perform much like fields in Microsoft Word. An entity is a symbol pointing to an object that an XML processor expands, displays, or executes. An entity can point to objects like special characters, a filename, a string of characters, a Java routine, a graphic, or a web site. Entities are often used to break up large documents into smaller ones. You use entities in master documents to pull in the subdocuments. The following example shows an entity declaration and a subsequent entity reference. Whenever the processor encounters the reference &title; it replaces it with the string defined in the declaration.

<!ENTITY title "Replacing the Toner" >
...
<TITLE>&title;<\TITLE>

Finally, you can also pepper XML docs with comments, which look like this:

<!--This procedure is really good! Let’s not change it. -->

Create an XML File

The following XML file, which you can name "toner.xml," covers the procedures for changing the toner. The !DOCTYPE statement declares its conformance to the COPIER DTD. The SYSTEM directive shows the parser where to find that DTD file on the local disk. The third line pulls in the XSL stylesheet, "copier.xsl." Note that the name of the root node <COPIER> matches the document type.
<?xml version="1.0"?>
<!DOCTYPE COPIER SYSTEM "copier.dtd"> <!--Specify the document type-->
<?xml:stylesheet href="copier.xsl" type="text/xsl"?> <!--Specify the stylesheet-->
<COPIER> <!--Specify the root node--> <TITLE> How to Change the Copier Toner </TITLE>
<AUTHOR> William Rankin </AUTHOR>
<SECTION>
<SUBTITLE> Opening the Package </SUBTITLE>
<STEP> Carefully unwrap the cartridge </STEP>
<STEP>Rock it once or twice</STEP>
</SECTION> <SECTION> <SUBTITLE> Inserting the Toner </SUBTITLE>
<STEP> Open the door and insert the cartridge </STEP>
<STEP> Close the door</STEP>
</SECTION>
</COPIER>

Create a Document Type Definition (DTD)

A written DTD declares the name of a document type, which elements are required, how they are arranged, and what other elements they can contain. A DTD can also specify the attributes and entities that can be used.

The following is a DTD , which you can save in a file called "copier.dtd." It defines the rules for XML files, like "toner.xml", which describe different procedures for using the copier. It specifies that the root element <COPIER>, which is also the name of the document class, requires at least one <TITLE>, one <AUTHOR>, and one <SECTION>. It also requires that each <SECTION> contain at an optional <SUBTITLE>and at least one <STEP>. Finally, the <TITLE>, <AUTHOR>, <SUBTITLE>, and <STEP> elements consist of ordinary text characters (#PCDATA).

<!ELEMENT COPIER (TITLE+, AUTHOR+, SECTION+)> <!ELEMENT TITLE (#PCDATA)>
<!ELEMENT AUTHOR (#PCDATA)>
<!ELEMENT SECTION (SUBTITLE?, STEP+)>
<!ELEMENT SUBTITLE (#PCDATA)>
<!ELEMENT STEP (#PCDATA)>
The question mark, the asterisk, and the plus sign, are occurrence indicators, which regulate how many times an element occur within an element. Here are the element occurrence indicators:

? = Optional (0 or 1 time)
* = Optional and repeatable (0 or more times)
+ = Required and repeatable (1 or more times)

Create an XSL Stylesheet

You can use the same XSL stylesheets for all the XML documents that use the same elements. It is up to the XML document to reference and pull in the stylesheet, not the other way around. Once the stylesheet is engaged, however, it takes control of the output process.

An XSL stylesheet contains a set of template rules. A template rule has two parts: a pattern, which is matched against nodes (elements) in the source tree, and a template, which forms part of the result tree.

When loaded by the XML document into the browser, the stylesheet transforms the XML document into another kind of document for a specific medium (here, HTML for IE 5). It first makes a template for the root node, which includes the whole document. Then, one by one, it selects the child nodes within the root template and makes a template for each of them, adding formatting commands in each template as it goes along. It repeats the process for each new template until a template has been created for every child node (element) of the XML document. When finished constructing the template tree, it selects the root node (whole tree) and outputs the structured templates to the specified medium. For our purposes, the HTML document is output directly to IE 5 for display, but it can also output to a disk file, a network card, etc.

The following is an XSL style sheet you can name "copier.xsl" for displaying the file , "toner.xml" in Internet Explorer 5. This stylesheet could be used with any XML doc based on the COPIER DTD. Note that IE 5 requires the specification of the XSL namespace in the second line exactly as shown.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl"> <!--Open stylesheet-->

<xsl:template>

<xsl:value-of /> <!--Create a template for the whole document--> </xsl:template>

<xsl:template match="TITLE"> <!--Load matching TITLE nodes-->

<H3 STYLE="color:blue"> <!--Apply <H3></H3> style in blue-->
<xsl:apply-templates/> <!--Create TITLE template-->
</H3>
</xsl:template>

<xsl:template match="AUTHOR"> <!--Load matching AUTHOR nodes-->

<P><I>by <!--Add "by" to author name-->
<xsl:apply-templates/> <!--Create AUTHOR template in italics-->
</I></P>
</xsl:template>

<xsl:template match="SECTION"> <!--Load matching SECTION parent node-->

<xsl:apply-templates/> <!--Create empty template for children--> </xsl:template>

<xsl:template match="SUBTITLE"> <!--Load matching SUBTITLE nodes-->

<B STYLE="color:blue">
<xsl:apply-templates/></B> <!--Create SUBTITLE template in blue bold-->
</xsl:template>

<xsl:template match="STEP"> <!--Load matching STEP child nodes-->

<UL><LI>
<xsl:apply-templates/></LI></UL> <!--Create bulleted STEP template-->
</xsl:template>

<xsl:template match="/"> <!--Load root template-->

<HTML> <!--Create HTML document-->
<HEAD>
<TITLE>Copier Procedures</TITLE> <!--HTML title--> </HEAD>
<BODY bgcolor="#E9C2A6" text="#5C3317"> <!--Text and background colors-->
<xsl:apply-templates select="COPIER/*"/> <!--Create result tree-->
<I>Copyright 1999 Luddite Corp.</I> <!--Add copyright-->
</BODY>
</HTML>
</xsl:template>

</xsl:stylesheet> <!--Close stylesheet-->

If you feel so inclined, you can create the three files described above, "copier.dtd," "copier.xsl," and "toner.xml" and place them in the same directory. Using Version 4 or later of Internet Explorer (available from the Microsoft Web site), open the XML file, "toner.xml." If you have made a mistake, the browser displays the error. If you have everything right, it displays something like this:

© 1999 William H. DuBay