|
XML
- The Future of Metadata
Printable
PDF Version
Clive
Finkelstein
Extract from "Building Corporate Portals
with XML"
by Clive Finkelstein and Peter Aiken,
McGraw-Hill (Sep 1999) [ISBN: 0-07-913705-9]
Copyright © 1999,
The McGraw-Hill Companies, Inc. All rights reserved.
This
paper is based on an extract from the book: “Building
Corporate Portals with XML”, by Clive Finkelstein and Peter
Aiken, published by McGraw-Hill in September 1999. The paper
addresses one of the most significant developments of the Computer
industry for the future. It shows how Metadata and Data
Administration will shortly move into the mainstream and become
one of the most important aspects of the WWW, and of systems
development in general. The paper introduces the Extensible Markup
Language (XML) – the successor to HTML for the Internet, for
corporate Intranets and for Extranets. XML incorporates Metadata
in any document, to define the content and structure of that
document and any associated (or linked) resources. It has the
potential to transform integration of structured data (such as in
relational databases or legacy files) with unstructured data (such
as in text documents, reports, email, graphics, images, audio and
video files) for innovative application integration opportunities.
Corporate
Portals (also called Enterprise Portals - EPs) are based on Data
Warehousing technologies, using Metadata
and the Extensible Markup
Language (XML) to integrate both structured and unstructured
data throughout an enterprise. Metadata, XML and EPs will be vital
elements of the 21st century enterprise.
Structured
data exists in databases and data
files that are used by current and older operational systems in an
enterprise. We call these older systems legacy
systems; we call the
data they use legacy data. In most enterprises, structured data comprises only 10%
of the data, information and knowledge resources of the business;
the other 90% exists as unstructured
data in textual documents, or as graphics and images, or in
audio or video formats. These unstructured data sources are not
easily accessible to Data Warehouses, but EPs use metadata and XML
to integrate both structured and unstructured data seamlessly, for
easy access throughout the enterprise.
1.
What is Metadata and XML?
IT staff in most
enterprises have a common problem. How can they convince managers
to plan, budget and apply resources for metadata management? What
is metadata and why is it important? What technologies are
involved? Internet and Intranet technologies are part of the
answer and will get the immediate attention of management. XML is
the other technology. The following analogy may help you outline
to management the important role that metadata takes in an
enterprise.
1.1
What is Metadata?
Every country is
now interconnected in a vast, global telephone network. We are now
able to telephone anywhere in the world. We can phone a number,
and the telephone assigned to that number would ring in Russia, or
China, or in Outer Mongolia. But when it is answered, we may not
understand the person at the other end. They may speak a different
language. So we can be connected, but what is said has no meaning.
We cannot share information.
Today, we also
use a computer and the World Wide Web. We enter a web site address
into a browser on our desktop machine – a unique address in
words that is analogous to a telephone number. We can then be
connected immediately to a computer assigned to that address and
attached to the Internet anywhere in the world. That computer
sends a web page based on the address we have supplied, to be
displayed in our browser. This is typically in English, but may be
in another language. We are connected, but like the telephone
analogy – if it is in another language, what is said has no
meaning. We cannot share information.
Now consider the
reason why it is difficult for some of the systems used in an
organization to communicate with and share information with other
systems. Technically, the programs in each system are able to be
interconnected and so can communicate with other programs. But
they use different terms to refer to the same data that needs to
be shared. For example, an accounting system may use the term
“customer” to refer to a person or organization that buys
products or services. Another system may refer to the same person
or organization as a “client”. Sales may use the term
“prospect”. They all use different terminology – different
language – to refer to the same data and information. But if
they use the wrong language, again they cannot share information.
The problem is
even worse. Consider terminology used in different parts of the
business. Accountants use a “jargon” – a technical language
– which is difficult for non-accountants to understand. So also
the jargon used by engineers, or production people, or sales and
marketing people, or managers is difficult for others to
understand. They all speak a different “language”. What is
said has no meaning. They cannot easily share common information.
In fact in some enterprises it is a miracle that people manage to
communicate meaning at all!
Each
organization has its own internal language, its own jargon, which
has evolved over time so similar people can communicate meaning.
As we saw above, there can be more than one language or jargon
used in an organization. Metadata identifies an organization’s
own “language”. Where different terms refer to the same thing,
a common term is agreed for all to use. Then people can
communicate more clearly. And systems and programs can
intercommunicate with meaning. But without a clear definition and
without common use of an organization’s metadata, information
cannot be shared effectively throughout the enterprise.
Previously each
part of the business maintained its own version of “customer”,
or “client” or “prospect”. They defined processes – and
assigned staff – to add new customers, clients or prospects to
their own files and databases. When common details about
customers, clients or prospects changed, each redundant version of
that data also had to be changed. It requires staff to make these
changes. Yet these are all redundant processes making the same
changes to redundant data versions. This is enormously expensive
in time and people. It is also quite unnecessary.
The importance
of metadata can now be seen. Metadata
defines the common language used within an enterprise so that all
people, systems and programs can communicate precisely.
Confusion disappears. Common data is shared. And enormous cost
savings are made. For it means that redundant processes (used to
maintain redundant data versions up-to-date) are eliminated, as
the redundant data versions are integrated into a common data
version for all to share.
1.2
What is XML?
Much effort has
earlier gone into the definition and implementation of Electronic Data Interchange (EDI) standards to address the problem
of intercommunication between dissimilar systems and databases.
EDI has now been widely used for business-to-business commerce for
many years. It works well, but it is quite complex and very
expensive. As a result, it is cost-justifiable generally only for
large corporations.
Once an
organization’s metadata is defined and documented, all programs
can use it to communicate. EDI was the mechanism that was used
previously. But now this intercommunication has become much
easier.
Extensible
Markup Language (XML) is a new
Internet technology that has been developed to address this
problem. XML can be used to document the metadata used by one
system so that it can be integrated with the metadata used by
other systems. This is analogous to language dictionaries that are
used throughout the world, so that people from different countries
can communicate. Legacy files and other databases can now be
integrated more readily. Systems throughout the business can now
coordinate their activities more effectively as a direct result of
XML and management support for metadata.
XML now provides
the capability that was previously only available to large
organizations through the use of EDI. XML allows the metadata used
by each program and database to be published as the language to be
used for this intercommunication. But distinct from EDI, XML is
simple to use and inexpensive to implement for both small and
large organizations. Because of this simplicity, we like to think
of XML as:
“XML
is EDI for the Rest of Us”
XML
will become a major part of the application development
mainstream. It provides a bridge between structured and
unstructured data, delivered via XML then converted to HTML for
display in web browsers. Together with metadata, XML is a key
component in the design, development and deployment of Enterprise
Portals.
1.3
How Is Metadata Used with XML?
Metadata is used
to define the structure of an XML document or file. Metadata is
published in a Document Type
Definition (DTD) file for reference by other systems. A DTD
file defines the structure of an XML file or document. It is
analogous to the Database
Definition Language (DDL) file that is used to define the
structure of a database, but with a different syntax.
An example of an
XML document identifying data retrieved from a PERSON database is
illustrated in Figure 1. This includes metadata markup tags
(surrounded by < … >, such as <person_name>) that
provide various details about a person. From this, we can see that
it is easy to find specific contact information in <contact_details>,
such as <email>, <phone>, <fax> and
<mobile> (cell phone) numbers.
<PERSON person_id=“p1100” sex=“M”>
<person_name>
<given_name>Clive</given_name>
<surname>Finkelstein</surname>
</person_name>
<company>
Information Engineering Services Pty Ltd
</company>
<country>Australia</country>
<contact_details>
<email>cfink@ies.aust.com</email>
<phone>+61-8-9402-8300</phone>
<phone>(08)
9309-6163</phone>
<fax>+61-8-9402-8322</fax>
<mobile>+61-411-472-375</mobile>
<mobile>0411-472-375</mobile>
</contact_details>
</PERSON>
Figure
1: An example of an XML
document with metadata tags (surrounded by < … >)
identifying the meaning of following data
Although we have
not shown it in Figure 1, the DTD can specify that certain tags
must exist or are optional, and whether some tags can exist more
than once – such as
multiple <phone> and <mobile> tags above. XML is
introduced in more detail later in this paper.
Metadata that is
used by various industries, communities or bodies can be used with
XML to define markup vocabularies. The World Wide Web Consortium
(W3C) has developed a standard framework that can be used to
define these vocabularies. This is called the Resource
Description Framework (RDF). It is a model for metadata
applications that support XML. RDF was initiated by the W3C to
build standards for XML applications so that they can
inter-operate and intercommunicate more easily, avoiding the
communication problems that we discussed earlier.
With XML, many
applications that were difficult to implement before – often due
to metadata differences – now become possible. For example, an
organization can define the unique metadata used by each
supplier’s legacy inventory systems. This enables the
organization to place orders via the Internet directly with those
suppliers' systems, for automatic fulfillment of product orders.
XML is enabling
technology to integrate structured and unstructured data for next
generation E-Commerce and EDI applications. Web sites will evolve
to use XML, with far greater power and flexibility than offered by
HTML. Netscape Communicator 5.0 and Microsoft Internet Explorer
5.0 browsers both support XML. Most productivity tools and office
suites (such as Microsoft Office 2000) support XML. Business
Intelligence and Knowledge Management tools will support XML. XML
development tools are also being released so that XML applications
can be developed more easily.
The acceptance
of XML is progressing rapidly, as it offers a very simple – yet
extremely powerful – way to intercommunicate between different
databases and systems, both within and outside an organization.
How well an organization accesses and uses its knowledge resources
can determine its competitive advantage and future prosperity. Use
and application of knowledge will become even more important in
the competitive Armageddon of the Internet, in which we will all
participate.
2.
Transformations of the 1990s
There have been
three major transformations, or shifts, that have occurred in the
Computer Industry throughout the 1990s. Their impact extends far
beyond that industry. They are also transforming business and
society. They are moving us rapidly from the Industrial Age to the
Information Age.
2.1
The First Shift: The Internet
The First
Shift has already occurred: the impact that the World Wide Web
is having on business today. With the introduction of web browsers
in the early 90s, the Internet – already 20 years old at that
time – moved into the mainstream as organizations rushed to
establish their own web sites.
First generation
web sites – using Hypertext Markup Language (HTML) – were used
as billboards to the world. They provided static advertising and
marketing information for the benefit of customers and suppliers.
They implemented online information that was also available in
print advertisements, or as documentation in book or manual
formats. While effective with those static media, when transferred
to a web site they offered no benefit – only glitzy eye candy.
These static web sites also suffered from another disadvantage.
While they were easy to visit, they were also easy to leave with
the click of a mouse – when potential customers could not find
what they needed.
Second-generation
web sites added interactivity and more content to provide further
assistance. But alone, animated images or sounds and movie clips
do not provide real benefit to visitors. They are still
essentially "static" in their ability to bring real,
bottom-line benefit to the business. They need to be integrated
into the main purpose of the web site – as demonstration aids,
sales aids or information aids for example. When they provide this
purpose-focused capability, they move their web sites to the third
generation.
Electronic
Commerce sites that are extensively being established today are
part of these third generation web sites. They have the potential
to generate major revenue and profit for the business. But many of
these electronic storefronts are like the Lemonade stands of our
childhood – the first tentative ventures into a New World of
business. More is needed before the full potential of Electronic
Commerce can be realized.
2.2
The Second Shift: Java
The mid 90s saw
the start of the Second
Shift: the emergence of Java as a programming language able to
be executed anywhere regardless of hardware platform or operating
system. Java was first developed by a team lead by James Gosling
at Sun Microsystems in 1991. It was planned as a portable language
that could be executed from embedded devices such as TV set-top
boxes. But its potential to become a major programming language
that could transcend the hardware platform and operating system
dependencies of other languages was also recognized. This saw the
introduction by Sun in early 1995 of Java as a portable
programming language. It was seen as the “Holy Grail of
Computing”: a hardware and operating-systems-independent
language.
Java presented a
potential threat to Microsoft, as it could offer an alternative
operating environment to Windows and threaten its desktop
monopoly. Microsoft therefore embraced Java, but it added
extensions to use Windows-specific capabilities – so limiting
the portability of the language. This was the subject of a suit
brought by Sun against Microsoft in 1997, decided against
Microsoft in late 1998. The legal judgement required that
Microsoft remove its Windows-specific Java extensions within 90
days of the ruling.
Java today is
being adopted widely as a major object-oriented language across
the industry. Java virtual machines are now available for all
major operating system and hardware environments. Java compilers
are also available for most operating systems: desktop, server and
mainframe. The shift to Java is gathering steam, but it will be
many years before its full promise of "write once, run
anywhere" can be fully realized.
2.3
The Third Shift: Extensible Markup Language (XML)
The Third Shift
is the emergence of the Extensible Markup Language (XML) in the
late 90s. This shift is just starting. It promises to be as
significant as the first two. It has the ability to bring real,
bottom-line benefits to business – in cost-reduction, in greater
efficiency, in greater competition and in greater revenue.
XML is one of
the most significant developments of the Computer industry since
the World Wide Web and Java moved to their present positions of
importance. For the next 2 - 5 years this will be one of the most
important aspects of the Internet, and of systems development in
general. It has the potential to move metadata and data
administration also into the mainstream of systems development.
XML will present major business opportunities, when used with the
Internet, as a delivery channel for information from Data
Warehouses and Enterprise Portals.
XML will be the
successor to HTML for the Internet, Intranets, and for secure
Extranets between customers, suppliers and business partners. XML
incorporates metadata in any document, to define the content and
structure of that document and any associated (or linked)
resources. It has the potential to transform the integration of
structured data (such as in legacy files or relational databases)
with unstructured data (such as in text documents, reports, email,
graphics and images, audio and video resources, and web pages).
XML will be a significant technology for the deployment of Data
Warehouses and Enterprise Portals.
XML uses the
Extensible Style Language (XSL) and the Extensible Linking
Language (XLL) to achieve this integration. XML, XSL and XLL allow
the easy integration of dissimilar systems for multiple worldwide
customers and suppliers in any industry. It permits the ready
integration of those systems, regardless of whether they are
legacy systems and databases, Electronic Data Interchange (EDI)
systems or Electronic Commerce. It represents the future direction
of metadata and the important role that data administration will
take in systems development in the years ahead.
There are steps
that you can take now, to prepare today for the coming shift to
XML.
2.4
Preparing for an XML World
XML assumes that
your metadata has already been defined. This is necessary not only
for the new systems that you want to develop, but also for the
legacy systems and databases that you need to integrate with those
new systems. XML will enable this integration to be carried
out dynamically.
Data modeling
and strategic modeling methods help you to define the metadata
required by XML. These are Forward Engineering methods. They will
also enable you to eliminate redundant data versions and redundant
processes, to develop integrated databases for the Internet and
Intranets. This is not just the responsibility of data
administrators. It requires business knowledge also, gained by the
active involvement of business experts.
A knowledge of
the metadata types, metadata activities and metadata capture
techniques using Reverse Engineering methods will also help you to
extract the metadata from existing legacy systems and databases,
or from relational or object databases. XML will enable you to
combine reverse-engineered metadata with forward-engineered
metadata, for the seamless integration of structured and
unstructured data that characterizes truly effective Enterprise
Portals.
Interest in XML,
metadata and data administration will grow strongly. The XML
specifications are now essentially complete [XML], while the XSL
and XLL specifications were still evolving at the time of writing.
These specifications are defined by the World Wide Web Consortium
and are all available from their web site [W3C].
Some browser
support for XML was first included in Microsoft Internet Explorer
4.0. The Channel Definition Format (CDF) capability of Internet
Explorer 4.0 was based on the use of XML. More complete support
for XML is provided in Microsoft Internet Explorer 5.0 and
Netscape Communicator 5.0. We will also see wide XML support added
to DBMS products, to CASE tools, to Data Warehouse tools and also
to Client / Server development tools. We will see a new generation
of Knowledge Management tools evolve rapidly to take advantage of
the structured/unstructured data integration opportunities offered
by XML.
Several books
provide good treatment of XML. An initial introduction to XML (and
also Cascading Style Sheets) is provided by “XML:
A Primer” [St Laurent 1998]. XML used for web site
development, with HTML, XSL and XLL, is addressed in “XML:
Extensible Markup Language” [Harold 1998]. “XML Complete” [Holzner 1998] covers the use of XML with Java.
These can be used as detailed references for XML. “Web Farming for the Data Warehouse” [Hackathorn 1998] uses the
Internet, Intranets and XML for access to external data sources
for warehouse deployment.
We will now
examine XML concepts. In a short paper, of necessity this can only
be an overview, and it ignores any treatment of XSL and XLL. They
are all covered in greater detail in “Building Corporate
Portals with XML” [Finkelstein 1999]. More detail is also
available from the references above. We will start with the
initial purpose of XML, which was to provide a more effective
capability for defining document content than that offered by
HTML.
2.5
Some Problems using HTML
Tim Berners-Lee
at CERN, the originator of the Word Wide Web (WWW) in 1990,
developed Hypertext Markup Language (HTML) as a subset of the
Standard Generalized Markup Language (SGML). A standard for the
semantic tagging of documents, SGML evolved out of work done by
IBM in the 1970s. It is used in Defense and other industries that
deal with large amounts of structured text. SGML is powerful, but
it is also very complex and expensive.
HTML was defined
as a subset of SGML – specifically intended as an open
architecture language for the definition of WWW text files
transmitted using Hypertext Transport Protocol (HTTP) across the
Internet. HTML defines the layout of a web page to a web browser
running as an open architecture client. Microsoft Internet
Explorer and Netscape Communicator share over 90% of the web
browser market; both are now available free.
An HTML page
contains text as the content of a web page, as well as tags that
define headings, images, links, lists, tables and forms to display
on that page. These HTML tags also contain attributes that define
further details associated with a tag. An example of such
attributes is the location of an image to be displayed on the
page, its width, depth and border characteristics, and alternate
text to be displayed while the image is being transmitted to the
web browser.
Because of this
focus on layout, HTML is recognized as having some significant
problems:
1.
No
effective way to identify content of page: HTML tags
describe the layout of the page. Web browsers use the tags for
presentation purposes, but the actual text content has no specific
meaning associated with it. To a browser, text is only a series of
words to be presented on a web page for display purposes.
2.
Problems
locating content with search engines: Because of a lack of
meaning associated with the text in a web page, there is no
automatic way that search engines can determine meaning – except
by indexing relevant words, or by relying on manual definition of
keywords.
3.
Problems
accessing databases: Web pages are static. But when a web
form provides access to online databases, that data needs to be
displayed dynamically on the web page. Called “Dynamic HTML” (DHTML),
this capability enables dynamic content from a database to be
incorporated “on-the-fly” into an appropriate area on the web
page.
4.
Complexity
of dynamic programming: DHTML requires complex programming
to incorporate dynamic content into a web page. This may be
written as CGI, Perl, ActiveX, JavaScript or Java logic, executed
in the client, the web server, the database server, or all three.
5.
Problems
interfacing with back-end systems: This is a common
problem that has been with us since the beginning of the
Information Age. Systems written in one programming language for a
specific hardware platform, operating system and DBMS may not be
able to be migrated to a different environment without significant
change or a complete rewrite. Even though it is an open
architecture, HTML also is affected by our inability to move these
legacy systems to new environments.
Recognizing
these limitations of HTML, the W3C SGML working group (now called
the XML working group) was established in mid 1996. The purpose of
this group was to define a way to provide the power of SGML, while
also retaining the simplicity of HTML. The XML specifications were
born out of this activity [XML].
XML retains much
of the power and extensibility of SGML, while also being simple to
use and inexpensive to implement. It allows tags to be defined for
special purposes, with metadata definitions embedded internally in
a web document – or stored separately as a Document Type
Definition (DTD) script. A DTD is analogous to the Data Definition
Language script (DDL) used to define a database, but it has a
different syntax.
As we discussed
earlier, data modeling and metadata are key enablers in the use
and application of XML. The Internet and Intranets allow us to
communicate easily with other computers. Java allows us to write
program logic once, to be executed in many different environments.
But these technologies are useless if we cannot easily communicate
with and use existing legacy systems and databases.
We discussed
earlier that we can now make a phone call, instantly, anywhere in
the world. The telephone networks of every country are
interconnected. When we dial a phone number, a telephone assigned
to that number will ring in Russia, or China, or Outer Mongolia,
or elsewhere. It will be answered, but we may not understand the
language used by the person at the other end.
So it is also
with legacy systems. We need more than the simple communication
between computers afforded by the Internet. True, we could rewrite
the computer programs at each end in Java, C, C++, or some other
common language. But that alone would not enable effective and
automatic communication between those programs. Each program must
know the metadata used by the other program and its databases so
that they can communicate with each other.
Considerable
work has been carried out to address this problem. Much effort has
gone into definition and implementation of Electronic Data
Interchange (EDI) standards. EDI has now been widely used for
business-to-business commerce for many years. It works well, but
it is complex and expensive. As a result, it is cost-justifiable
generally only for larger corporations.
XML now also
provides this capability. It allows the metadata used by each
program and database to be published as the language to be used
for this intercommunication. But distinct from EDI, XML is simple
to use and inexpensive to implement. XML will become a major part
of the application development mainstream. It provides a bridge
between structured databases and unstructured text, delivered via
XML then converted to HTML during a transition period for display
in web browsers. Web sites will evolve over time to use XML, XSL
and XLL natively to provide the capability and functionality
presently offered by HTML, but with greater power and flexibility.
XML components are listed in Table 1.
Table 1:
Components of XML
| Acronym |
Name |
Description |
| XML |
Extensible
Markup Language |
Defines
document content using metadata tags and namespaces |
| DTD |
Document
Type Definition |
Defines XML
document structure (analogous to DDL schema) |
| XSL |
Extensible
Style Language |
XSL or
Cascading Style Sheets (CSS) separate layout from data |
| XLL |
Extensible
Linking Language |
XLL
implements multi-directional links (single or multiple) |
| DOM |
Document
Object Model |
Implements
a standard API for processing XML in any language |
| RDF |
Resource
Description Framework |
W3
Interoperability Project for data content interchange |
The rest of this
paper provides an introduction to XML and DTDs, with only brief
reference to XSL, XLL, DOM and RDF. Further information in each of
these areas can be obtained from the book and web site references
provided at the end of the paper.
3.
A Simple XML Example
We will start
our introduction to XML with a customer example in Figure 2. This
illustrates some basic XML concepts. It shows customer data (in
italics), such as entered from an online web form or accessed from
a customer database. It shows the inclusion of metadata “tags”
(surrounded by < and >) – such as <customer_name>.
The tag: <customer_name>
is a start tag; the text following it is the actual content of the
customer name: XYZ
Corporation. It is terminated by an end tag: the same
tag-name, but now preceded by “/” – such as </customer_name>.
Other fields define <customer_address>,
<street>, <city>, <state> and <postcode>. Each of these tags is also terminated by an end
tag, such as </street>,
</city>, </state> and </postcode>.
The example concludes with </customer_address>
and </CUSTOMER>
end tags.
<CUSTOMER>
<customer_name>XYZ
Corporation</customer_name>
<customer_address>
<street>123
First Street</street>
<city>Any Town</city>
<state>WA</state>
<postcode>12345</postcode>
</customer_address>
</CUSTOMER>
Figure
2: A Simple XML Example
From this simple
example of XML metadata, we can see how the meaning of the text
between start and end tags is clearly defined. We can also see
that search engines can use these definitions for more accuracy in
identifying information to satisfy a specific query.
Even more
effective applications become possible For example, an
organization can define the unique metadata used by its suppliers'
legacy inventory systems. This will enable that organization to
place orders via the Internet directly with those suppliers'
systems, for automatic fulfillment of product orders. XML is
enabling technology to integrate unstructured text and structured
databases for next generation E-Commerce and EDI applications.
The following
pages now examine the XML syntax in more detail.
3.1
XML Naming Conventions
An XML document
must be “well formed”. To be well formed, a document must obey
the following rules:
§
A tag name must start with a
letter or underscore, with no spaces. Thus “person_id”
is correct, but not “person
id” or “1st name”.
§
XML names are case sensitive. For example, “PERSON”,
“Person” and “person”
are all different names.
§
Each tag must have surrounding < and > indicators, as
in the start tag <tag_name>.
§
Each start tag must also have an end tag, as in </tag_name>.
§
If a tag is empty, it must still
have an end tag or empty tag such as <CUSTOMER></CUSTOMER> or <country/> (i.e. Empty).
§
Attribute values are preceded by
an = sign and are surrounded by double or single quotes, such as version=“1.0”
standalone=“YES”.
§
The characters <, >, &,
“ or ‘ cannot be used in XML except when replaced by their
“escaped” versions. Thus the character string < represents “<” at all times until it is to be
displayed. Similarly >
is “>”, & is “&”; "e;
is “ and ' is
‘. These character sequences are called “predefined entity
references”.
A well-formed
XML document example follows in Figure 3, similar to the earlier
XML example in Figure 1.
<PERSON person_id=“p1100” sex=“M”>
(Attributes in
Element)
<person_name>
(Children of
<given_name>Clive</given_name>
“person_name”
<surname>Finkelstein</surname>
Element)
</person_name>
<email>cfink@ies.aust.com</email>
<company>
Information Engineering Services Pty Ltd
</company>
<country>Australia</country>
<phone>+61-8-9402-8300</phone>
<fax>+61-8-9402-8322</fax>
</PERSON>
Figure
3: Example of a Well-Formed
XML Document
Notice that
double quote characters in Figure 3 surround the attribute values
of PERSON, declared on
the first line with the values: person_id=“p1100” sex=“M”.
3.2
The XML Document Prolog
Every XML
document starts with an XML declaration as part of its prolog.
This declaration must be the first statement on the first line of
the document. It is defined as a processing instruction
(surrounded by <? … ?> tags) such as:
<?xml version=“1.0” standalone=“yes”
encoding=“Unicode”?>
The <?xml
specifies that the document uses XML syntax. An XML parser or
application can analyze the content of the document prior to it
being processed. The tag “XML”, “xml” or any upper and
lower case combination of this sequence of letters is reserved and
cannot be used in any tag name.
The version
number is specified for compatibility with future XML versions.
The standalone specification indicates whether a Document Type
Definition is included in-line (“standalone=yes”) or
out-of-line in an external file (“standalone=no”). We will
discuss this shortly in relation to DOCTYPE
Declarations.
The
“encoding” statement specifies the language-encoding format
used by the XML document. XML has been defined so it can be used
with any language, such as English and European languages, as well
as double byte Asian languages – Japanese, Chinese or Korean.
3.3
DOCTYPE Declarations
A Document Type
declaration (“DOCTYPE”) immediately follows the <?XML …
?> statement. Every XML document contains a root name, which
includes all other XML tag names. The DOCTYPE statement identifies
the specific root name used by the document. It also identifies
the location of the Document Type Definition (DTD) file that is to
be used with the document.
A DOCTYPE
declaration has the following formats, with examples:
<!DOCTYPE root_element_name
[ … ]>
OR
<!DOCTYPE root_element_name SYSTEM “DTD_URL”>
1.
<!DOCTYPE CUSTOMER [ … ]>
2.
<!DOCTYPE CUSTOMER SYSTEM “customer.dtd”>
3.
<!DOCTYPE supplier PUBLIC
“http://www.ind-xml.com/supplier.dtd”>
The
first example specifies that the DOCTYPE is declared internally in
the same document. We will see an example of this format shortly.
The second
example declares that an external DTD is used as a private file
(“SYSTEM”). It is the DTD file that is located at the relative
Uniform Resource Locator (URL) “customer.dtd”
within the same web site directory.
The third
example specifies that the DTD is PUBLIC. It is the DTD file at
the absolute URL “http://www.ind-xml.com/supplier.dtd”.
3.4
URL and URI
These DOCTYPE
examples use relative or absolute URLs to identify the location of
an external DTD file. But files and other resources can be moved
to different URL locations. With HTML web pages, every link that
refers to a moved resource must be updated to refer to its new
URL. HTML links can be from web sites anywhere in the world. These
can all refer to the same URL. Relocating a resource to a
different URL can therefore require considerable maintenance work.
To overcome this
problem, in time XML and XLL will enable resources to be located
instead by a Uniform Resource Identifier (URI). Distinct from a
URL, a URI can never change. XLL, with XLinks and XPointers,
define a URI. The URI always points to that resource.
3.5
XML Comments
Comments can be
used in an XML document to describe the purpose, intent and use of
different statements. Comments can also document and separate
logical sections of a document.
Comments in XML
are defined similarly to HTML comments, surrounded by <!-- …
--> tags. For example:
<!-- This is a comment and is not processed -->
Comments
can contain any data except the literal “-->" but may
not be placed inside an XML tag. In the next two examples, the
first comment is wrong; the second comment is correct:
<customer_name <!-- Defines customer name --> >
(incorrect)
<!-- Defines customer name --> <customer_name>
(correct)
However
comments can be used to surround and hide tags, such as:
<-- The following tag is used only for retail customers
<retail-code>2</retail-code>
and is ignored for
wholesale customers -->
An
XML parser or XML application cannot process the
<retail-code> tag until the surrounding comment is removed,
or until the tag is moved outside the comment. While it remains
within the comment, for XML processing purposes the tag does not
exist.
3.6
Processing Instructions
Processing
instructions (PIs) declare applications that will be used to
process part (or all) of an XML document.
Like comments, they are not part of the XML document. An
XML processor must pass a PI unchanged to the relevant XML
application. A PI has the format:
<?PI_target-name PI_data?>
The
PI_target_name
identifies the application. PI_data
following the PI_target_name
is optional; it is specified by and used by the PI application. We
saw a PI example in XML
Document Prolog. A document to be processed by an XML parser
or processor was declared by the PI statement:
<?xml version=“1.0” standalone=“yes”
encoding=“Unicode”?>
XML
applications should process only the targets they recognize. PI
names that begin with “XML”, in any combination of upper or
lower case, are reserved for use in XML standards. PIs are used
for document-specific application processing.
4.
XML Elements, Attributes and Entities
XML defines
metadata tags using elements, attributes and entities. In the
following sections we will learn how XML uses these to declare
metadata tags.
4.1
Declaring XML Elements
The tags that we
have seen are all examples of XML “elements”. An element is a
named metadata tag that is declared in a DOCTYPE statement. As we
have seen, a DOCTYPE can be defined externally in a DTD file,
located using a relative or absolute URL. Alternatively, a DOCTYPE
can be defined internally. It is included in-line, immediately
following the XML processing declaration. Figure 4 shows an
internal declaration of the DOCTYPE statement that is used with
the customer example in Figure 2:
<?XML version=“1.0” standalone=“YES”?>
<!DOCTYPE CUSTOMER
[
<!ELEMENT CUSTOMER
ANY>
<!ELEMENT
customer_name (#PCDATA)>
<!ELEMENT
customer_address EMPTY>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT postcode
(#PCDATA)>
]>
Figure
4: DOCTYPE Declaration for
the Customer Example in Figure 2
The DOCTYPE
statement in Figure 4 defines the XML root name as
<CUSTOMER>. Square left and right brackets ([ … ]) follow,
surrounding all element declarations. The DOCTYPE root name
<CUSTOMER> is declared as an ELEMENT, specified to be ANY
(case-sensitive). This indicates that any element, as well as
parsed character data (shown as italics in Figure 2) can appear in
a <CUSTOMER> element.
Each element
must be uniquely named within an XML document. As a DOCTYPE
declaration can be defined using more than one DTD, the concept of
namespaces has been included in XML. This allows an alias to be
assigned to a DTD. The namespace alias can then be used to qualify
named elements that would otherwise violate this rule, so ensuring
uniqueness.
The declaration
of an XML Element in Figure 4 has the format:
<!ELEMENT element-name
content_type>
As
with all XML tags, the element-name
starts with a letter or underscore, can have no spaces and all
names are case sensitive. Thus Customer,
customer and CUSTOMER
are different XML element-names. Because of this and to avoid
confusion, it is recommended that an element-name be declared
using a case-sensitive name that always refers to that same
element. Once declared, that element-name
is used as the tag-name;
the terms element-name
and tag-name are therefore synonymous.
The content_type
of an Element can have values of ANY, EMPTY, (#PCDATA), or a
(Child List) as discussed next.
We discussed an
example of ANY in relation to the root name element
<CUSTOMER> in Figure 4. This indicates that any element, as
well as parsed character data, can appear within it.
By default, tags
are non-empty and are
followed by data (see italics in Figure 2). An element is
declared EMPTY if it normally has no data. For example, the
element <customer-address> in Figure 2 contains
<street>, <city>, <state> and <postcode>
elements within it. These are called child elements. As the parent
element, the data for <customer-address> is provided by its
child elements. The <customer-address> element is therefore
declared in Figure 4 to be EMPTY.
Figure 4
declares the <customer_name> element is (#PCDATA). This
specifies that the element contains “Parsed Character Data”.
In Figure 2, we now know that XYZ
Corporation is character data that is parsed by an XML parser,
or processed by an XML application.
Similarly, the
<street>, <city>, <state> and <postcode>
elements in Figure 4 are also declared to be (#PCDATA). Note that
<postcode> in Figure 2 contains the numeric characters:
“12345”, not the numeric value. For example, consider the
following element declaration and corresponding tag:
<!ELEMENT
customer_balance (#PCDATA)>
…
… …
<customer_balance>$15,500.00</customer_balance>
Before
the <customer_balance> data of $15,500.00 can be processed
by an XML application, it must first be converted from the numeric
characters “$15,500.00” to the numeric currency value of
$15,500.00.
When we consider
address data in an application, there are many variations. For
example, in addition to the street number and name, some customers
may have a floor or level number, and/or an apartment, suite or
flat number. We could define each of these as separate elements
within <CUSTOMER>. But
this data could be considered as part of the normal content for
the <street>
element. Some customers may need two or more <street> elements. For others, <postcode> may not be available and so could be omitted.
To this point
there is nothing in the declaration of CUSTOMER to control whether
any or all of the declared elements must exist. We provide extra
control by specifying a content
model.
<?XML version=“1.0” standalone=“YES”?>
<!DOCTYPE
CUSTOMER
[
<!ELEMENT CUSTOMER
ANY>
<!ELEMENT
customer_name (#PCDATA)>
<!ELEMENT
customer_address (street, city, state, postcode)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT postcode (#PCDATA)>
]>
Figure
5: Constraints added to the
CUSTOMER DOCTYPE Declaration
Figure 5
indicates that <customer_address> has a (child
list). This child list is a content model, which specifies
that <customer_address>
has child elements of (street,
city, state, postcode). This comma-delimited format indicates
that each customer address has only one <street>,
<city>, <state> and <postcode>
element. Each element must appear in the specified sequence.
When a child
list is defined using commas, each element is mandatory and must
exist in that sequence. Alternatively, if any elements validly may
not exist, they can be separated by “|” to indicate
optionality, such as:
<!ELEMENT customer_address (street | city | state |
postcode)>
If
we also add a <contacts>
element, with child elements of <phone>,
<fax>, <mobile> (cell phone) and <email>
elements we find even more variations. We therefore need to be
able to specify the number of occurrences of child elements that
are valid.
<?XML version=“1.0” standalone=“YES” ?>
<!DOCTYPE
CUSTOMER
[
<!ELEMENT CUSTOMER
ANY>
<!ELEMENT
customer_name (#PCDATA)>
<!ELEMENT
customer_address (street+, city, state, postcode?)>
<!-- ? = zero or one; * = zero or more; + = one or more
-->
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT postcode (#PCDATA)>
<!ELEMENT contacts
(phone+, fax*, mobile?, email?)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT fax (#PCDATA)>
<!ELEMENT mobile (#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>
Figure
6: Further constraints added
to the CUSTOMER DOCTYPE declaration for customer_address and
contacts.
Figure 6 adds
validity constraints to child elements by including a suffix
character attached to the child name. A suffix of “?”
specifies that zero or one occurrence of the child element may
exist within the parent element. A suffix of “*” specifies
that zero or more occurrences may exist, while a suffix of “+”
specifies that at least one or more occurrences of the relevant
child element must exist within the parent element. No suffix
indicates that the element must exist only once.
Examining Figure
6 further, we see that <customer_address>
must have at least one <street>
element, but it can have more street
occurrences (street+). There must be only one <city> and one <state>
element (no suffix). But the <postcode>
element is optional; there may be none, or one occurrence
(postcode?). The comma delimiters specify that the elements must
appear in the declared sequence.
We may want to
show that a valid customer_address
can have several addresses within it. We can place these child
elements all within brackets, with the relevant group suffix
character following the right bracket. We also use surrounding
brackets to group other elements.
<!ELEMENT customer_address
(street+ | (city, state) | postcode)+>
The
above fragment indicates by outer brackets with a suffix “+”
that there must be at least one or more group of addresses. Within
an address group, there must be one or more street
elements (street+) OR a city
element followed by a state
element OR a postcode element).
Of course, all elements can also exist in the above example.
In Figure 6 we
also saw new element declarations of <contacts>:
<phone>; <fax>;
<mobile>; and <email>.
We see that <contacts>
has a content model with child elements of (phone+,
fax*, mobile?, email?)>.
Based on the
suffix attached to each of the <contacts>
child names, Figure 6 specifies that there must be at least one or
more <phone>
occurrences (phone+) and zero or more <fax>
occurrences (fax*). There can be zero or one <mobile>
occurrence (mobile?), and also zero or one <email> occurrence (email?).
We can use a
content model that includes PCDATA. For example, we can
alternatively specify <phone> and <fax> by the
fragment:
<!ELEMENT phone (#PCDATA | (country-code, area-code,
phone-number))*>
<!ELEMENT fax (#PCDATA| (country-code,
area-code, phone-number))*>
<!ELEMENT country-code (#PCDATA)>
<!ELEMENT area-code (#PCDATA)>
<!ELEMENT phone-number (#PCDATA)>
There
can be zero or more <phone>
and <fax> – by
the suffix “*” after the outer brackets. These can contain
parsed character data, or they may optionally have a child element
group in the sequence of (<country-code>,
<area-code> and
<phone-number>).
All content models that include PCDATA must have this format:
PCDATA must come first, vertical bars must separate all elements
or element groups, and the entire outer group must be optional.
These
constraints enable an XML Parser or XML application to confirm the
validity of the document, by checking the number of child element
occurrences within each parent element. They validate these
occurrences against those specified by the child list constraints
for the parent element in an internal DOCTYPE declaration, or a
DOCTYPE in an external DTD.
4.2
Declaring XML Attributes
An element may
contain one or more attributes to provide additional details about
that element. Figure 3 earlier included an example of attributes
for the PERSON element. This specified a PERSON occurrence, with a
unique identification attribute called person_id
and another attribute called
sex, repeated now in Figure 7.
<PERSON person_id=“p1100” sex=“M”>
Figure
7: The PERSON Element, with
Attributes
This example
shows that attributes and their values are enclosed within the
< and > characters of the start tag for an element,
immediately following the element name.
Each attribute
of an element is declared in DOCTYPE ATTLIST, using the format in
Figure 8. The ATTLIST format specifies the element_name
and then defines an attribute_name
as a unique XML name within all of the element’s attributes. It
observes all of the rules detailed earlier in XML
Naming Conventions.
<!ATTLIST element_name attribute_name
type “default
value”>
Where
type = (CDATA
| ID | IDREF | IDREFS | ENTITY | ENTITIES | NMTOKEN | NMTOKENS |
NOTATION)
Figure
8: The ATTLIST Format
The type
specification in Figure 8 is defined from an enumerated list of
valid values: (CDATA | ID | IDREF | IDREFS | ENTITY | ENTITIES |
NMTOKEN | NMTOKENS | NOTATION).
CDATA represents
“Character Data”, as a character data type that is non-markup
text. This is somewhat analogous to a Data Definition Language
(DDL) SQL data type of VARCHAR, as used by DBMS products.
Note that XML
does not support the other DD |