OK, So What is This XML Thing?
David Hay, Essential
You've heard about it. It doesnt mean
"extra medium large" on a shirt. It has something to do with the web. It has
something to do with metadata. But what is it?
The "Extensible Markup Language" (XML) is a
document description language, much like "Hypertext Markup Language" (HTML) used
to construct web pages. It is much more versatile than HTML, however, and as such it has
profound implications on how we view what the web is and what it can do.
Books describing XML tend to weigh in at
five pounds or more, which is a shame, since the basic structure and purpose of the
language isnt nearly that big. This article is a much briefer presentation (in
English) of the essential concepts involved.
It's More Than HTML
HTML is a language used to create web
pages, using a series of "tags" which instruct the software reading it how to
present the material. See sidebar for a further description of HTML.
Like HTML, XML is a system of tags that
describe components of a document. In its simplest incarnation, it could be viewed simply
as an advanced version of HTML. In fact it is not: It and HTML are both sub-sets of
something called "Standard Generalized Markup Language", or SGML. This is a
sophisticated tag language, which, "due to [its] complexity, and the complexity of
the tools required," as the Object Management Group has so delicately put it,
"has not achieved widespread uptake."
HTML consists of a set of predefined
"tags" that instruct a piece of software called a "browser" to do
certain things with the document. Typically these tags describe aspects of presentation,
such as font style and size, line spacing, and so forth. Some tags, however, also identify
links to other pages, drawings, artwork and so forth. The point is that every browser used
by everyone on the internet knows how to interpret these tags and what to do with them.
Since these tags are primarily concerned with presentation of the data, however, it is not
possible to use tags to describe data structure or in any other way to describe the
contents of a document.
XML allows tags to be defined by users.
This gives users tremendous power to describe the structure and nature of the information
presented in a document. This means, however, that standard browsers will not be able to
do anything with these extensions. This makes the software environment for XML more
complex, as described below.
Unlike HTML, XML does not allow description
of the presentation of data. An associated language, "Extensible Style Language"
(XSL) must be used to address this.
What is it?
Here is an example of XML used to describe
a data record that might be presented in a document:
<!-- **** Basket **** -->
Note a few interesting things about this
First of all, as with HTML, each tag is
surrounded by less than and greater than brackets (<>), and is usually followed by
text. The text is in turn followed by an end tag, in the form </...>. A tag may have
no content, in which case either the end tag follows immediately upon the tag (as in
<specification></specification>), or the tag itself ends with a forward slash
(as in <specification/>). Unlike with HTML, however, the end tag is always
A second thing to note is that, in this
case, following the tag for product, a set of related tags follow, describing
characteristics (columns, in this case) of product. In this particular case, the tag
<PRODUCT> has been defined such that it must be followed by exactly one tag for
<product_id> and one for <product_name>. You cant see this from the
example, but <unit_of_measure> is optional. The tag <specification> is also
optional, and there also may be one or more occurrences of it.
All XML documents must begin with <?XML
version="1.0"?> (or whatever version number is appropriate.)
Comments are in the form <!-- . . .
--> Note that the double hyphens must be part of the comment. Note also that, unlike
HTML, XML lets you use a comment to surround lines of code that you want to disable.
The meaning of a tag is defined in a
"document type declaration" (DTD). This is a body of code that defines tags
through a set of "ELEMENTS".
The DTD for the above example looks like
<!ELEMENT PRODUCT (product_id, product_name,
<!ELEMENT product_id (#PCDATA)>
<!ELEMENT product_name (#PCDATA)>
<!ELEMENT unit_of_measure (#PCDATA)>
<!ELEMENT specification (variable, value)>
<!ELEMENT variable (#PCDATA)>
<!ELEMENT value (#PCDATA)>
The DTD for an XML document can be either
part of the document or in an external file. If it is external, the DOCTYPE statement
still occurs in the document, with the argument "SYSTEM -filename-", where
"-filename-" is the name of the file containing the DTD. For example, if the
above DTD were in an external file called "xxx.dtd", the DOCTYPE statement would
<!DOCTYPE product SYSTEM xxx.dtd>
The same line would then also appear as the
first line in the file xxx.dtd.
The definition for the element product
includes a list of other elements that must follow in this case, product_id,
product_name, unit_of_measure, and specification. The "?" after unit_of_measure
means that one occurrence may or may not follow. Its optional. The "*"
after specification means that it is optional, but one or more occurrences may follow.
If there were a "+" after any
element in the list, it would mean the element is not optional, and that there may
be more than one occurrence of it.
Each of the elements in the list is then
defined in turn in one of the lines that follow. "#PCDATA" means that the tag
will contain text that can be parsed by browsing software. Specification is further
elaborated upon as being followed by variable and value.
XML is case sensitive. XML keywords
are in all uppercase. The case of a tag names must be the same as in its DTD definition.
By convention, entity/table names in the above example are all in uppercase, while
attribute/column names are all in lowercase. Conventions will vary.
Tags can have attributes. For example,
instead of listing associated tags in defining <!ELEMENT specification (variable,
value)>, above, the following line could be added to the DTD:
<!ATTLIST specification variable CDATA #required>
<!ATTLIST specification value CDATA #required>
This creates "variable" and "value" as
two attributes of specification, so they do not have to appear as element in their own
right. The data from the above example would then look like this:
<!-- **** Basket **** -->
Note that this provides yet another design
decision in the lap of the XML designer. There are advantages and disadvantages to each
way of doing this.
Three levels of correctness are associated
with an XML document:
- A "well-formed" XML document is one where the
elements are properly structured as a tree, with the opening and closing tags correctly
nested. Well-formed documents are essential for information exchange.
- A "valid" XML document is well formed and has tags
that correspond to the document type declaration. It contains only elements and attribute
values that conform to the DTD. While an XML document can be prepared and read without a
DTD, a DTD is essential for establishing validity.
- A "semantically correct" XML document is beyond
the control of XML. It is incumbent upon the preparer of the document to insure that it is
logically structured and makes sense.
The question remains, what does all this
mean? The answer to that question is not obvious. Clearly web screens that display data
from a database can be designed to do so more easily and with more control.
Not in the language, however, is the
mechanism by which data will actually be retrieved from a database and placed in this
page. If web pages are to be created with database data, software must be written to
retrieve those data and create the pages. Presumably this would be in some combination of
Java and SQL.
In addition, a standard browser, by
definition, cannot properly interpret customized tags.
This can be addressed in one of three ways:
- Software "applets" may be written and attached to
the page. These would understand the data structure and respond accordingly to each tag.
- Generic software may read the DTD and respond to tags
accordingly. In this case, the response would be limited to what can be inferred from the
- A community may define a set of tags for its purposes, agree
to use them, and develop community-specific software to respond to them.
Presumably the first two options will be in
Java or a similar language, but the standard tools for doing this remain to be written.
The third option has already begun to take effect. For example, the chemical industry has
set up an XML-based Chemical Markup Language, and astronomers, mathematicians and
the like have similarly defined sets of tags for describing things in their respective
Used to Describe Data
One feature of XML that has captured the
industry's imagination is its ability to describe data structures and hold data. As was
seen in the above example, with XML, you can define new tags specifically to describe the
equivalent of tables and columns in a relational database structure. More significantly,
the tags for a set of columns or attributes can be related to the tags for their parent
table or entity.
While the tag structure does seem to be a
good vehicle for describing and communicating database structure, the requirement for
discipline in the way we organize data is more present than ever. XML doesnt care if
we have repeating groups, monstrous data structures, or whatever. If we are to use XML to
express a data structure, it is incumbent upon us to do as good a job with the tool as we
Following in the tradition of the chemists
and astronomers described above, the Object Management Group (OMG) has settled on a set of
XML tags they call the XML Metadata Interchange (XMI) as a way to describe in standard
terms the structure of data about data ("metadata"). This is useful in
communicating between CASE tools, and in describing a "metadata repository".
Along the same lines, a group of companies are in the process of defining a Common
Warehouse Metadata Interchange (CWMI) that comprises a subset of the XMI tags to support
This means that there are actually two ways
that a database structure can be described in XML:
First, an application database can be
described in the DTD of an XML document. In this case the operational data contained in
the described database could be placed between sets of the described tags. The DTD could,
for example, be generated by one CASE tool and read by another one as a way of
communicating data structure from one to the other.
A second approach is to make the table and
column definitions data that appear between tags of an XMI metamodel. This is a little
more arcane, since the XMI metamodel is very abstract, but using the XMI metamodel
allows for description of much more than tables and columns.)
Note, however, that the issue in defining a
metadata repository or communicating between CASE tools is not the use of XML or any other
particular language. The issue is the database structure and its semantics. The important
question is not how a universal metadata repository will be represented. It could
as easily be represented by a set of relational tables or an entity/relationship diagram.
The questions are, whats in it and what does it mean? XML by itself
does not answer that question. Which objects are significant and should be described? That
is the harder question. Having a new language for describing them doesn't seem to
contribute to that conversation.
Indeed, in recognizing that XML is a good
vehicle for describing database structure, the issue that seems most obvious is that this
will put greater responsibility on data administrators to define data correctly. XML will
not do that. XML will only record whatever data design (good or bad) human beings come up
As Clive Finkelstein has said, the advent
of XML is going to make data modelers and designers even more important than they are now.
"After fifteen years of obscurity, data modelers can finally become overnight
Back to Contents.
Sidebar: About HTML
In 1992, Tim Berners-Lee of the
European High-Energy Particle Physics Lab (CERN) invented the world wide web, and with it
the hypertext markup language (HTML). HTML is a language used to create web pages, using a
series of "tags" which instruct the software reading it how to present it.
The pages created using HTML are read by a
piece of software called a "browser". This software knows how to interpret the
tags and present the results on a computer monitor.
Typical contents of a web page might be:
<font face="cg omega" size=4>
<img src="images/blank.gif" width=50 height=25>
Essential Strategies, Inc. is a consulting firm specializing in helping companies use
<i><b> enterprise architecture</b></I>
Each tag is surrounded by greater than and
less than brackets (<>), and conveys a single instruction to the browser. Some tags
have attributes, conveying additional information.
In the example, <font> specifies
characteristics of the font to be used in all text between the tag and the </font>
tag. (In general, a tag is in effect until a corresponding end tag (the same word,
preceded by a "/") is encountered.) Two attributes are shown, one describing the
typeface to use ("cg omega"), and one describing the relative size of the text
(4). Note that a user can tell a browser some of the characteristics of the page. The text
size specified in the web page, for example, is only a relative number (1-5). The actual
size is determined by browser settings.
The tag <br> inserts a line break.
The <img> tag retrieves a graphics
file and displays it. In this case the file is "images/blank.gif", which is a
blank rectangle. Other attributes of the <img> tag specify its width and height.
Since line spacing can only be specified in terms of whole lines skipped, and there is no
provision for indenting the first line of a paragraph, this is a trick for putting a half
line of space (25 pixels) between two lines, and indenting the first line by 50 pixels.
In the middle of the text <i> and
<b> cause the words which follow to be in italics and bold face, respectively. The
tags </b> and </i> turn off the bold face and italics.
Tags can also be used to bring in another
document from anywhere else on the web. Specifically, the tag
displays the word "netscape" (the
value between the <a> tag and the </a> tag), and if the user places the cursor
on that word and clicks, the page at "http://www.netscape.com" will appear.
About the Author . . . David Hay
A thirty year veteran of the Information
Industry, Dave Hay has been producing data models to support strategic information
planning and requirements planning for over twelve years. He has worked in a variety of
industries, including, among others, power generation, clinical pharmaceutical research,
oil refining, forestry, and broadcast. He is President of Essential Strategies, Inc., a
consulting firm dedicated to helping clients define corporate information architecture,
identify requirements, and plan strategies for the implementation of new systems. He is
the author of the book, Data Model Patterns: Conventions of Thought, recently
published by Dorset House, and producer of Data Model Patterns: Data Architecture in
a Box, an Oracle Designer repository containing his model templates.
He may be reached at email@example.com, phone:
+1 (713) 464-8316, or http://www.essentialstrategies.com.
The author wishes to express gratitude to Clive Finkelstein
who introduced him to XML and graciously took time to proof-read and correct this article.
Copyright © 1999 David Hay, Essential
Back to Contents.