Methodologies and Technologies for Rapid Enterprise Architecture Delivery


| Home | Courses | Certification | Projects | Papers | Contact Us |



TEN Archive

Contact Us




OK, So What is This XML Thing?

Printable PDF Version

David Hay, Essential Strategies

You've heard about it. It doesn’t mean "extra medium large" on a shirt. It has something to do with the web. It has something to do with metadata. But what is it?

The "Extensible Markup Language" (XML) is a document description language, much like "Hypertext Markup Language" (HTML) used to construct web pages. It is much more versatile than HTML, however, and as such it has profound implications on how we view what the web is and what it can do.

Books describing XML tend to weigh in at five pounds or more, which is a shame, since the basic structure and purpose of the language isn’t nearly that big. This article is a much briefer presentation (in English) of the essential concepts involved.

It's More Than HTML

HTML is a language used to create web pages, using a series of "tags" which instruct the software reading it how to present the material. See sidebar for a further description of HTML.

Like HTML, XML is a system of tags that describe components of a document. In its simplest incarnation, it could be viewed simply as an advanced version of HTML. In fact it is not: It and HTML are both sub-sets of something called "Standard Generalized Markup Language", or SGML. This is a sophisticated tag language, which, "due to [its] complexity, and the complexity of the tools required," as the Object Management Group has so delicately put it, "has not achieved widespread uptake."

HTML consists of a set of predefined "tags" that instruct a piece of software called a "browser" to do certain things with the document. Typically these tags describe aspects of presentation, such as font style and size, line spacing, and so forth. Some tags, however, also identify links to other pages, drawings, artwork and so forth. The point is that every browser used by everyone on the internet knows how to interpret these tags and what to do with them. Since these tags are primarily concerned with presentation of the data, however, it is not possible to use tags to describe data structure or in any other way to describe the contents of a document.

XML allows tags to be defined by users. This gives users tremendous power to describe the structure and nature of the information presented in a document. This means, however, that standard browsers will not be able to do anything with these extensions. This makes the software environment for XML more complex, as described below.

Unlike HTML, XML does not allow description of the presentation of data. An associated language, "Extensible Style Language" (XSL) must be used to address this.

What is it?


Here is an example of XML used to describe a data record that might be presented in a document:

<?XML version="1.0"?>
<!-- **** Basket **** -->







Note a few interesting things about this example.

First of all, as with HTML, each tag is surrounded by less than and greater than brackets (<>), and is usually followed by text. The text is in turn followed by an end tag, in the form </...>. A tag may have no content, in which case either the end tag follows immediately upon the tag (as in <specification></specification>), or the tag itself ends with a forward slash (as in <specification/>). Unlike with HTML, however, the end tag is always required.

A second thing to note is that, in this case, following the tag for product, a set of related tags follow, describing characteristics (columns, in this case) of product. In this particular case, the tag <PRODUCT> has been defined such that it must be followed by exactly one tag for <product_id> and one for <product_name>. You can’t see this from the example, but <unit_of_measure> is optional. The tag <specification> is also optional, and there also may be one or more occurrences of it.

All XML documents must begin with <?XML version="1.0"?> (or whatever version number is appropriate.)

Comments are in the form <!-- . . . --> Note that the double hyphens must be part of the comment. Note also that, unlike HTML, XML lets you use a comment to surround lines of code that you want to disable.

The meaning of a tag is defined in a "document type declaration" (DTD). This is a body of code that defines tags through a set of "ELEMENTS".

The DTD for the above example looks like this:

<!DOCTYPE product


<!ELEMENT PRODUCT (product_id, product_name, unit_of_measure?, specification*)>

<!ELEMENT product_id (#PCDATA)>

<!ELEMENT product_name (#PCDATA)>

<!ELEMENT unit_of_measure (#PCDATA)>

<!ELEMENT specification (variable, value)>

<!ELEMENT variable (#PCDATA)>

<!ELEMENT value (#PCDATA)>


The DTD for an XML document can be either part of the document or in an external file. If it is external, the DOCTYPE statement still occurs in the document, with the argument "SYSTEM -filename-", where "-filename-" is the name of the file containing the DTD. For example, if the above DTD were in an external file called "xxx.dtd", the DOCTYPE statement would read:

<!DOCTYPE product SYSTEM xxx.dtd>

The same line would then also appear as the first line in the file xxx.dtd.

The definition for the element product includes a list of other elements that must follow – in this case, product_id, product_name, unit_of_measure, and specification. The "?" after unit_of_measure means that one occurrence may or may not follow. It’s optional. The "*" after specification means that it is optional, but one or more occurrences may follow.

If there were a "+" after any element in the list, it would mean the element is not optional, and that there may be more than one occurrence of it.

Each of the elements in the list is then defined in turn in one of the lines that follow. "#PCDATA" means that the tag will contain text that can be parsed by browsing software. Specification is further elaborated upon as being followed by variable and value.


XML is case sensitive. XML keywords are in all uppercase. The case of a tag names must be the same as in its DTD definition. By convention, entity/table names in the above example are all in uppercase, while attribute/column names are all in lowercase. Conventions will vary.


Tags can have attributes. For example, instead of listing associated tags in defining <!ELEMENT specification (variable, value)>, above, the following line could be added to the DTD:

<!ATTLIST specification variable CDATA #required>
<!ATTLIST specification value CDATA #required>

This creates "variable" and "value" as two attributes of specification, so they do not have to appear as element in their own right. The data from the above example would then look like this:

<?XML version="1.0"?>

<!-- **** Basket **** -->





<specification variable="color", value="blue">


<specification variable="size", value="large">





Note that this provides yet another design decision in the lap of the XML designer. There are advantages and disadvantages to each way of doing this.


Three levels of correctness are associated with an XML document:

  • A "well-formed" XML document is one where the elements are properly structured as a tree, with the opening and closing tags correctly nested. Well-formed documents are essential for information exchange.
  • A "valid" XML document is well formed and has tags that correspond to the document type declaration. It contains only elements and attribute values that conform to the DTD. While an XML document can be prepared and read without a DTD, a DTD is essential for establishing validity.
  • A "semantically correct" XML document is beyond the control of XML. It is incumbent upon the preparer of the document to insure that it is logically structured and makes sense.


The question remains, what does all this mean? The answer to that question is not obvious. Clearly web screens that display data from a database can be designed to do so more easily and with more control.

Not in the language, however, is the mechanism by which data will actually be retrieved from a database and placed in this page. If web pages are to be created with database data, software must be written to retrieve those data and create the pages. Presumably this would be in some combination of Java and SQL.

In addition, a standard browser, by definition, cannot properly interpret customized tags.

This can be addressed in one of three ways:

  1. Software "applets" may be written and attached to the page. These would understand the data structure and respond accordingly to each tag.
  2. Generic software may read the DTD and respond to tags accordingly. In this case, the response would be limited to what can be inferred from the DTD.
  3. A community may define a set of tags for its purposes, agree to use them, and develop community-specific software to respond to them.

Presumably the first two options will be in Java or a similar language, but the standard tools for doing this remain to be written. The third option has already begun to take effect. For example, the chemical industry has set up an XML-based Chemical Markup Language, and astronomers, mathematicians and the like have similarly defined sets of tags for describing things in their respective fields.

Used to Describe Data

One feature of XML that has captured the industry's imagination is its ability to describe data structures and hold data. As was seen in the above example, with XML, you can define new tags specifically to describe the equivalent of tables and columns in a relational database structure. More significantly, the tags for a set of columns or attributes can be related to the tags for their parent table or entity.

While the tag structure does seem to be a good vehicle for describing and communicating database structure, the requirement for discipline in the way we organize data is more present than ever. XML doesn’t care if we have repeating groups, monstrous data structures, or whatever. If we are to use XML to express a data structure, it is incumbent upon us to do as good a job with the tool as we can.

Following in the tradition of the chemists and astronomers described above, the Object Management Group (OMG) has settled on a set of XML tags they call the XML Metadata Interchange (XMI) as a way to describe in standard terms the structure of data about data ("metadata"). This is useful in communicating between CASE tools, and in describing a "metadata repository". Along the same lines, a group of companies are in the process of defining a Common Warehouse Metadata Interchange (CWMI) that comprises a subset of the XMI tags to support data warehouses.

This means that there are actually two ways that a database structure can be described in XML:

First, an application database can be described in the DTD of an XML document. In this case the operational data contained in the described database could be placed between sets of the described tags. The DTD could, for example, be generated by one CASE tool and read by another one as a way of communicating data structure from one to the other.

A second approach is to make the table and column definitions data that appear between tags of an XMI metamodel. This is a little more arcane, since the XMI metamodel is very abstract, but using the XMI metamodel allows for description of much more than tables and columns.)

Note, however, that the issue in defining a metadata repository or communicating between CASE tools is not the use of XML or any other particular language. The issue is the database structure and its semantics. The important question is not how a universal metadata repository will be represented. It could as easily be represented by a set of relational tables or an entity/relationship diagram. The questions are, what’s in it and what does it mean? XML by itself does not answer that question. Which objects are significant and should be described? That is the harder question. Having a new language for describing them doesn't seem to contribute to that conversation.

Indeed, in recognizing that XML is a good vehicle for describing database structure, the issue that seems most obvious is that this will put greater responsibility on data administrators to define data correctly. XML will not do that. XML will only record whatever data design (good or bad) human beings come up with.

As Clive Finkelstein has said, the advent of XML is going to make data modelers and designers even more important than they are now. "After fifteen years of obscurity, data modelers can finally become overnight successes."

Back to Contents.

Sidebar: About HTML

In 1992, Tim Berners-Lee of the European High-Energy Particle Physics Lab (CERN) invented the world wide web, and with it the hypertext markup language (HTML). HTML is a language used to create web pages, using a series of "tags" which instruct the software reading it how to present it.

The pages created using HTML are read by a piece of software called a "browser". This software knows how to interpret the tags and present the results on a computer monitor.

Typical contents of a web page might be:

<font face="cg omega" size=4>
<img src="images/blank.gif" width=50 height=25>
Essential Strategies, Inc. is a consulting firm specializing in helping companies use <i><b> enterprise architecture</b></I>

Each tag is surrounded by greater than and less than brackets (<>), and conveys a single instruction to the browser. Some tags have attributes, conveying additional information.

In the example, <font> specifies characteristics of the font to be used in all text between the tag and the </font> tag. (In general, a tag is in effect until a corresponding end tag (the same word, preceded by a "/") is encountered.) Two attributes are shown, one describing the typeface to use ("cg omega"), and one describing the relative size of the text (4). Note that a user can tell a browser some of the characteristics of the page. The text size specified in the web page, for example, is only a relative number (1-5). The actual size is determined by browser settings.

The tag <br> inserts a line break.

The <img> tag retrieves a graphics file and displays it. In this case the file is "images/blank.gif", which is a blank rectangle. Other attributes of the <img> tag specify its width and height. Since line spacing can only be specified in terms of whole lines skipped, and there is no provision for indenting the first line of a paragraph, this is a trick for putting a half line of space (25 pixels) between two lines, and indenting the first line by 50 pixels.

In the middle of the text <i> and <b> cause the words which follow to be in italics and bold face, respectively. The tags </b> and </i> turn off the bold face and italics.

Tags can also be used to bring in another document from anywhere else on the web. Specifically, the tag

<a HREF="">netscape</a>

displays the word "netscape" (the value between the <a> tag and the </a> tag), and if the user places the cursor on that word and clicks, the page at "" will appear.

About the Author . . . David Hay

A thirty year veteran of the Information Industry, Dave Hay has been producing data models to support strategic information planning and requirements planning for over twelve years. He has worked in a variety of industries, including, among others, power generation, clinical pharmaceutical research, oil refining, forestry, and broadcast. He is President of Essential Strategies, Inc., a consulting firm dedicated to helping clients define corporate information architecture, identify requirements, and plan strategies for the implementation of new systems. He is the author of the book, Data Model Patterns: Conventions of Thought, recently published by Dorset House, and producer of Data Model Patterns: Data Architecture in a Box™, an Oracle Designer repository containing his model templates.

He may be reached at, phone: +1 (713) 464-8316, or

The author wishes to express gratitude to Clive Finkelstein who introduced him to XML and graciously took time to proof-read and correct this article.

Copyright 1999 David Hay, Essential Strategies, Inc

Back to Contents.


| Home | Courses | Certification | Projects | Papers | TEN Archive | Contact Us | [Search |

(c) Copyright 1995-2015 Clive Finkelstein. All Rights Reserved.