|
XML
- The Future of Metadata
Printable
PDF Version
Clive
Finkelstein
Extract from "Building Corporate Portals
with XML"
by Clive Finkelstein and Peter Aiken,
McGraw-Hill (Sep 1999) [ISBN: 0-07-913705-9]
Copyright © 1999,
The McGraw-Hill Companies, Inc. All rights reserved.
This
paper is based on an extract from the book: “Building
Corporate Portals with XML”, by Clive Finkelstein and Peter
Aiken, published by McGraw-Hill in September 1999. The paper
addresses one of the most significant developments of the Computer
industry for the future. It shows how Metadata and Data
Administration will shortly move into the mainstream and become
one of the most important aspects of the WWW, and of systems
development in general. The paper introduces the Extensible Markup
Language (XML) – the successor to HTML for the Internet, for
corporate Intranets and for Extranets. XML incorporates Metadata
in any document, to define the content and structure of that
document and any associated (or linked) resources. It has the
potential to transform integration of structured data (such as in
relational databases or legacy files) with unstructured data (such
as in text documents, reports, email, graphics, images, audio and
video files) for innovative application integration opportunities.
Corporate
Portals (also called Enterprise Portals - EPs) are based on Data
Warehousing technologies, using Metadata
and the Extensible Markup
Language (XML) to integrate both structured and unstructured
data throughout an enterprise. Metadata, XML and EPs will be vital
elements of the 21st century enterprise.
Structured
data exists in databases and data
files that are used by current and older operational systems in an
enterprise. We call these older systems legacy
systems; we call the
data they use legacy data. In most enterprises, structured data comprises only 10%
of the data, information and knowledge resources of the business;
the other 90% exists as unstructured
data in textual documents, or as graphics and images, or in
audio or video formats. These unstructured data sources are not
easily accessible to Data Warehouses, but EPs use metadata and XML
to integrate both structured and unstructured data seamlessly, for
easy access throughout the enterprise.
1.
What is Metadata and XML?
IT staff in most
enterprises have a common problem. How can they convince managers
to plan, budget and apply resources for metadata management? What
is metadata and why is it important? What technologies are
involved? Internet and Intranet technologies are part of the
answer and will get the immediate attention of management. XML is
the other technology. The following analogy may help you outline
to management the important role that metadata takes in an
enterprise.
1.1
What is Metadata?
Every country is
now interconnected in a vast, global telephone network. We are now
able to telephone anywhere in the world. We can phone a number,
and the telephone assigned to that number would ring in Russia, or
China, or in Outer Mongolia. But when it is answered, we may not
understand the person at the other end. They may speak a different
language. So we can be connected, but what is said has no meaning.
We cannot share information.
Today, we also
use a computer and the World Wide Web. We enter a web site address
into a browser on our desktop machine – a unique address in
words that is analogous to a telephone number. We can then be
connected immediately to a computer assigned to that address and
attached to the Internet anywhere in the world. That computer
sends a web page based on the address we have supplied, to be
displayed in our browser. This is typically in English, but may be
in another language. We are connected, but like the telephone
analogy – if it is in another language, what is said has no
meaning. We cannot share information.
Now consider the
reason why it is difficult for some of the systems used in an
organization to communicate with and share information with other
systems. Technically, the programs in each system are able to be
interconnected and so can communicate with other programs. But
they use different terms to refer to the same data that needs to
be shared. For example, an accounting system may use the term
“customer” to refer to a person or organization that buys
products or services. Another system may refer to the same person
or organization as a “client”. Sales may use the term
“prospect”. They all use different terminology – different
language – to refer to the same data and information. But if
they use the wrong language, again they cannot share information.
The problem is
even worse. Consider terminology used in different parts of the
business. Accountants use a “jargon” – a technical language
– which is difficult for non-accountants to understand. So also
the jargon used by engineers, or production people, or sales and
marketing people, or managers is difficult for others to
understand. They all speak a different “language”. What is
said has no meaning. They cannot easily share common information.
In fact in some enterprises it is a miracle that people manage to
communicate meaning at all!
Each
organization has its own internal language, its own jargon, which
has evolved over time so similar people can communicate meaning.
As we saw above, there can be more than one language or jargon
used in an organization. Metadata identifies an organization’s
own “language”. Where different terms refer to the same thing,
a common term is agreed for all to use. Then people can
communicate more clearly. And systems and programs can
intercommunicate with meaning. But without a clear definition and
without common use of an organization’s metadata, information
cannot be shared effectively throughout the enterprise.
Previously each
part of the business maintained its own version of “customer”,
or “client” or “prospect”. They defined processes – and
assigned staff – to add new customers, clients or prospects to
their own files and databases. When common details about
customers, clients or prospects changed, each redundant version of
that data also had to be changed. It requires staff to make these
changes. Yet these are all redundant processes making the same
changes to redundant data versions. This is enormously expensive
in time and people. It is also quite unnecessary.
The importance
of metadata can now be seen. Metadata
defines the common language used within an enterprise so that all
people, systems and programs can communicate precisely.
Confusion disappears. Common data is shared. And enormous cost
savings are made. For it means that redundant processes (used to
maintain redundant data versions up-to-date) are eliminated, as
the redundant data versions are integrated into a common data
version for all to share.
1.2
What is XML?
Much effort has
earlier gone into the definition and implementation of Electronic Data Interchange (EDI) standards to address the problem
of intercommunication between dissimilar systems and databases.
EDI has now been widely used for business-to-business commerce for
many years. It works well, but it is quite complex and very
expensive. As a result, it is cost-justifiable generally only for
large corporations.
Once an
organization’s metadata is defined and documented, all programs
can use it to communicate. EDI was the mechanism that was used
previously. But now this intercommunication has become much
easier.
Extensible
Markup Language (XML) is a new
Internet technology that has been developed to address this
problem. XML can be used to document the metadata used by one
system so that it can be integrated with the metadata used by
other systems. This is analogous to language dictionaries that are
used throughout the world, so that people from different countries
can communicate. Legacy files and other databases can now be
integrated more readily. Systems throughout the business can now
coordinate their activities more effectively as a direct result of
XML and management support for metadata.
XML now provides
the capability that was previously only available to large
organizations through the use of EDI. XML allows the metadata used
by each program and database to be published as the language to be
used for this intercommunication. But distinct from EDI, XML is
simple to use and inexpensive to implement for both small and
large organizations. Because of this simplicity, we like to think
of XML as:
“XML
is EDI for the Rest of Us”
XML
will become a major part of the application development
mainstream. It provides a bridge between structured and
unstructured data, delivered via XML then converted to HTML for
display in web browsers. Together with metadata, XML is a key
component in the design, development and deployment of Enterprise
Portals.
1.3
How Is Metadata Used with XML?
Metadata is used
to define the structure of an XML document or file. Metadata is
published in a Document Type
Definition (DTD) file for reference by other systems. A DTD
file defines the structure of an XML file or document. It is
analogous to the Database
Definition Language (DDL) file that is used to define the
structure of a database, but with a different syntax.
An example of an
XML document identifying data retrieved from a PERSON database is
illustrated in Figure 1. This includes metadata markup tags
(surrounded by < … >, such as <person_name>) that
provide various details about a person. From this, we can see that
it is easy to find specific contact information in <contact_details>,
such as <email>, <phone>, <fax> and
<mobile> (cell phone) numbers.
<PERSON person_id=“p1100” sex=“M”>
<person_name>
<given_name>Clive</given_name>
<surname>Finkelstein</surname>
</person_name>
<company>
Information Engineering Services Pty Ltd
</company>
<country>Australia</country>
<contact_details>
<email>cfink@ies.aust.com</email>
<phone>+61-8-9402-8300</phone>
<phone>(08)
9309-6163</phone>
<fax>+61-8-9402-8322</fax>
<mobile>+61-411-472-375</mobile>
<mobile>0411-472-375</mobile>
</contact_details>
</PERSON>
Figure
1: An example of an XML
document with metadata tags (surrounded by < … >)
identifying the meaning of following data
Although we have
not shown it in Figure 1, the DTD can specify that certain tags
must exist or are optional, and whether some tags can exist more
than once – such as
multiple <phone> and <mobile> tags above. XML is
introduced in more detail later in this paper.
Metadata that is
used by various industries, communities or bodies can be used with
XML to define markup vocabularies. The World Wide Web Consortium
(W3C) has developed a standard framework that can be used to
define these vocabularies. This is called the Resource
Description Framework (RDF). It is a model for metadata
applications that support XML. RDF was initiated by the W3C to
build standards for XML applications so that they can
inter-operate and intercommunicate more easily, avoiding the
communication problems that we discussed earlier.
With XML, many
applications that were difficult to implement before – often due
to metadata differences – now become possible. For example, an
organization can define the unique metadata used by each
supplier’s legacy inventory systems. This enables the
organization to place orders via the Internet directly with those
suppliers' systems, for automatic fulfillment of product orders.
XML is enabling
technology to integrate structured and unstructured data for next
generation E-Commerce and EDI applications. Web sites will evolve
to use XML, with far greater power and flexibility than offered by
HTML. Netscape Communicator 5.0 and Microsoft Internet Explorer
5.0 browsers both support XML. Most productivity tools and office
suites (such as Microsoft Office 2000) support XML. Business
Intelligence and Knowledge Management tools will support XML. XML
development tools are also being released so that XML applications
can be developed more easily.
The acceptance
of XML is progressing rapidly, as it offers a very simple – yet
extremely powerful – way to intercommunicate between different
databases and systems, both within and outside an organization.
How well an organization accesses and uses its knowledge resources
can determine its competitive advantage and future prosperity. Use
and application of knowledge will become even more important in
the competitive Armageddon of the Internet, in which we will all
participate.
2.
Transformations of the 1990s
There have been
three major transformations, or shifts, that have occurred in the
Computer Industry throughout the 1990s. Their impact extends far
beyond that industry. They are also transforming business and
society. They are moving us rapidly from the Industrial Age to the
Information Age.
2.1
The First Shift: The Internet
The First
Shift has already occurred: the impact that the World Wide Web
is having on business today. With the introduction of web browsers
in the early 90s, the Internet – already 20 years old at that
time – moved into the mainstream as organizations rushed to
establish their own web sites.
First generation
web sites – using Hypertext Markup Language (HTML) – were used
as billboards to the world. They provided static advertising and
marketing information for the benefit of customers and suppliers.
They implemented online information that was also available in
print advertisements, or as documentation in book or manual
formats. While effective with those static media, when transferred
to a web site they offered no benefit – only glitzy eye candy.
These static web sites also suffered from another disadvantage.
While they were easy to visit, they were also easy to leave with
the click of a mouse – when potential customers could not find
what they needed.
Second-generation
web sites added interactivity and more content to provide further
assistance. But alone, animated images or sounds and movie clips
do not provide real benefit to visitors. They are still
essentially "static" in their ability to bring real,
bottom-line benefit to the business. They need to be integrated
into the main purpose of the web site – as demonstration aids,
sales aids or information aids for example. When they provide this
purpose-focused capability, they move their web sites to the third
generation.
Electronic
Commerce sites that are extensively being established today are
part of these third generation web sites. They have the potential
to generate major revenue and profit for the business. But many of
these electronic storefronts are like the Lemonade stands of our
childhood – the first tentative ventures into a New World of
business. More is needed before the full potential of Electronic
Commerce can be realized.
2.2
The Second Shift: Java
The mid 90s saw
the start of the Second
Shift: the emergence of Java as a programming language able to
be executed anywhere regardless of hardware platform or operating
system. Java was first developed by a team lead by James Gosling
at Sun Microsystems in 1991. It was planned as a portable language
that could be executed from embedded devices such as TV set-top
boxes. But its potential to become a major programming language
that could transcend the hardware platform and operating system
dependencies of other languages was also recognized. This saw the
introduction by Sun in early 1995 of Java as a portable
programming language. It was seen as the “Holy Grail of
Computing”: a hardware and operating-systems-independent
language.
Java presented a
potential threat to Microsoft, as it could offer an alternative
operating environment to Windows and threaten its desktop
monopoly. Microsoft therefore embraced Java, but it added
extensions to use Windows-specific capabilities – so limiting
the portability of the language. This was the subject of a suit
brought by Sun against Microsoft in 1997, decided against
Microsoft in late 1998. The legal judgement required that
Microsoft remove its Windows-specific Java extensions within 90
days of the ruling.
Java today is
being adopted widely as a major object-oriented language across
the industry. Java virtual machines are now available for all
major operating system and hardware environments. Java compilers
are also available for most operating systems: desktop, server and
mainframe. The shift to Java is gathering steam, but it will be
many years before its full promise of "write once, run
anywhere" can be fully realized.
2.3
The Third Shift: Extensible Markup Language (XML)
The Third Shift
is the emergence of the Extensible Markup Language (XML) in the
late 90s. This shift is just starting. It promises to be as
significant as the first two. It has the ability to bring real,
bottom-line benefits to business – in cost-reduction, in greater
efficiency, in greater competition and in greater revenue.
XML is one of
the most significant developments of the Computer industry since
the World Wide Web and Java moved to their present positions of
importance. For the next 2 - 5 years this will be one of the most
important aspects of the Internet, and of systems development in
general. It has the potential to move metadata and data
administration also into the mainstream of systems development.
XML will present major business opportunities, when used with the
Internet, as a delivery channel for information from Data
Warehouses and Enterprise Portals.
XML will be the
successor to HTML for the Internet, Intranets, and for secure
Extranets between customers, suppliers and business partners. XML
incorporates metadata in any document, to define the content and
structure of that document and any associated (or linked)
resources. It has the potential to transform the integration of
structured data (such as in legacy files or relational databases)
with unstructured data (such as in text documents, reports, email,
graphics and images, audio and video resources, and web pages).
XML will be a significant technology for the deployment of Data
Warehouses and Enterprise Portals.
XML uses the
Extensible Style Language (XSL) and the Extensible Linking
Language (XLL) to achieve this integration. XML, XSL and XLL allow
the easy integration of dissimilar systems for multiple worldwide
customers and suppliers in any industry. It permits the ready
integration of those systems, regardless of whether they are
legacy systems and databases, Electronic Data Interchange (EDI)
systems or Electronic Commerce. It represents the future direction
of metadata and the important role that data administration will
take in systems development in the years ahead.
There are steps
that you can take now, to prepare today for the coming shift to
XML.
2.4
Preparing for an XML World
XML assumes that
your metadata has already been defined. This is necessary not only
for the new systems that you want to develop, but also for the
legacy systems and databases that you need to integrate with those
new systems. XML will enable this integration to be carried
out dynamically.
Data modeling
and strategic modeling methods help you to define the metadata
required by XML. These are Forward Engineering methods. They will
also enable you to eliminate redundant data versions and redundant
processes, to develop integrated databases for the Internet and
Intranets. This is not just the responsibility of data
administrators. It requires business knowledge also, gained by the
active involvement of business experts.
A knowledge of
the metadata types, metadata activities and metadata capture
techniques using Reverse Engineering methods will also help you to
extract the metadata from existing legacy systems and databases,
or from relational or object databases. XML will enable you to
combine reverse-engineered metadata with forward-engineered
metadata, for the seamless integration of structured and
unstructured data that characterizes truly effective Enterprise
Portals.
Interest in XML,
metadata and data administration will grow strongly. The XML
specifications are now essentially complete [XML], while the XSL
and XLL specifications were still evolving at the time of writing.
These specifications are defined by the World Wide Web Consortium
and are all available from their web site [W3C].
Some browser
support for XML was first included in Microsoft Internet Explorer
4.0. The Channel Definition Format (CDF) capability of Internet
Explorer 4.0 was based on the use of XML. More complete support
for XML is provided in Microsoft Internet Explorer 5.0 and
Netscape Communicator 5.0. We will also see wide XML support added
to DBMS products, to CASE tools, to Data Warehouse tools and also
to Client / Server development tools. We will see a new generation
of Knowledge Management tools evolve rapidly to take advantage of
the structured/unstructured data integration opportunities offered
by XML.
Several books
provide good treatment of XML. An initial introduction to XML (and
also Cascading Style Sheets) is provided by “XML:
A Primer” [St Laurent 1998]. XML used for web site
development, with HTML, XSL and XLL, is addressed in “XML:
Extensible Markup Language” [Harold 1998]. “XML Complete” [Holzner 1998] covers the use of XML with Java.
These can be used as detailed references for XML. “Web Farming for the Data Warehouse” [Hackathorn 1998] uses the
Internet, Intranets and XML for access to external data sources
for warehouse deployment.
We will now
examine XML concepts. In a short paper, of necessity this can only
be an overview, and it ignores any treatment of XSL and XLL. They
are all covered in greater detail in “Building Corporate
Portals with XML” [Finkelstein 1999]. More detail is also
available from the references above. We will start with the
initial purpose of XML, which was to provide a more effective
capability for defining document content than that offered by
HTML.
2.5
Some Problems using HTML
Tim Berners-Lee
at CERN, the originator of the Word Wide Web (WWW) in 1990,
developed Hypertext Markup Language (HTML) as a subset of the
Standard Generalized Markup Language (SGML). A standard for the
semantic tagging of documents, SGML evolved out of work done by
IBM in the 1970s. It is used in Defense and other industries that
deal with large amounts of structured text. SGML is powerful, but
it is also very complex and expensive.
HTML was defined
as a subset of SGML – specifically intended as an open
architecture language for the definition of WWW text files
transmitted using Hypertext Transport Protocol (HTTP) across the
Internet. HTML defines the layout of a web page to a web browser
running as an open architecture client. Microsoft Internet
Explorer and Netscape Communicator share over 90% of the web
browser market; both are now available free.
An HTML page
contains text as the content of a web page, as well as tags that
define headings, images, links, lists, tables and forms to display
on that page. These HTML tags also contain attributes that define
further details associated with a tag. An example of such
attributes is the location of an image to be displayed on the
page, its width, depth and border characteristics, and alternate
text to be displayed while the image is being transmitted to the
web browser.
Because of this
focus on layout, HTML is recognized as having some significant
problems:
1.
No
effective way to identify content of page: HTML tags
describe the layout of the page. Web browsers use the tags for
presentation purposes, but the actual text content has no specific
meaning associated with it. To a browser, text is only a series of
words to be presented on a web page for display purposes.
2.
Problems
locating content with search engines: Because of a lack of
meaning associated with the text in a web page, there is no
automatic way that search engines can determine meaning – except
by indexing relevant words, or by relying on manual definition of
keywords.
3.
Problems
accessing databases: Web pages are static. But when a web
form provides access to online databases, that data needs to be
displayed dynamically on the web page. Called “Dynamic HTML” (DHTML),
this capability enables dynamic content from a database to be
incorporated “on-the-fly” into an appropriate area on the web
page.
4.
Complexity
of dynamic programming: DHTML requires complex programming
to incorporate dynamic content into a web page. This may be
written as CGI, Perl, ActiveX, JavaScript or Java logic, executed
in the client, the web server, the database server, or all three.
5.
Problems
interfacing with back-end systems: This is a common
problem that has been with us since the beginning of the
Information Age. Systems written in one programming language for a
specific hardware platform, operating system and DBMS may not be
able to be migrated to a different environment without significant
change or a complete rewrite. Even though it is an open
architecture, HTML also is affected by our inability to move these
legacy systems to new environments.
Recognizing
these limitations of HTML, the W3C SGML working group (now called
the XML working group) was established in mid 1996. The purpose of
this group was to define a way to provide the power of SGML, while
also retaining the simplicity of HTML. The XML specifications were
born out of this activity [XML].
XML retains much
of the power and extensibility of SGML, while also being simple to
use and inexpensive to implement. It allows tags to be defined for
special purposes, with metadata definitions embedded internally in
a web document – or stored separately as a Document Type
Definition (DTD) script. A DTD is analogous to the Data Definition
Language script (DDL) used to define a database, but it has a
different syntax.
As we discussed
earlier, data modeling and metadata are key enablers in the use
and application of XML. The Internet and Intranets allow us to
communicate easily with other computers. Java allows us to write
program logic once, to be executed in many different environments.
But these technologies are useless if we cannot easily communicate
with and use existing legacy systems and databases.
We discussed
earlier that we can now make a phone call, instantly, anywhere in
the world. The telephone networks of every country are
interconnected. When we dial a phone number, a telephone assigned
to that number will ring in Russia, or China, or Outer Mongolia,
or elsewhere. It will be answered, but we may not understand the
language used by the person at the other end.
So it is also
with legacy systems. We need more than the simple communication
between computers afforded by the Internet. True, we could rewrite
the computer programs at each end in Java, C, C++, or some other
common language. But that alone would not enable effective and
automatic communication between those programs. Each program must
know the metadata used by the other program and its databases so
that they can communicate with each other.
Considerable
work has been carried out to address this problem. Much effort has
gone into definition and implementation of Electronic Data
Interchange (EDI) standards. EDI has now been widely used for
business-to-business commerce for many years. It works well, but
it is complex and expensive. As a result, it is cost-justifiable
generally only for larger corporations.
XML now also
provides this capability. It allows the metadata used by each
program and database to be published as the language to be used
for this intercommunication. But distinct from EDI, XML is simple
to use and inexpensive to implement. XML will become a major part
of the application development mainstream. It provides a bridge
between structured databases and unstructured text, delivered via
XML then converted to HTML during a transition period for display
in web browsers. Web sites will evolve over time to use XML, XSL
and XLL natively to provide the capability and functionality
presently offered by HTML, but with greater power and flexibility.
XML components are listed in Table 1.
Table 1:
Components of XML
| Acronym |
Name |
Description |
| XML |
Extensible
Markup Language |
Defines
document content using metadata tags and namespaces |
| DTD |
Document
Type Definition |
Defines XML
document structure (analogous to DDL schema) |
| XSL |
Extensible
Style Language |
XSL or
Cascading Style Sheets (CSS) separate layout from data |
| XLL |
Extensible
Linking Language |
XLL
implements multi-directional links (single or multiple) |
| DOM |
Document
Object Model |
Implements
a standard API for processing XML in any language |
| RDF |
Resource
Description Framework |
W3
Interoperability Project for data content interchange |
The rest of this
paper provides an introduction to XML and DTDs, with only brief
reference to XSL, XLL, DOM and RDF. Further information in each of
these areas can be obtained from the book and web site references
provided at the end of the paper.
3.
A Simple XML Example
We will start
our introduction to XML with a customer example in Figure 2. This
illustrates some basic XML concepts. It shows customer data (in
italics), such as entered from an online web form or accessed from
a customer database. It shows the inclusion of metadata “tags”
(surrounded by < and >) – such as <customer_name>.
The tag: <customer_name>
is a start tag; the text following it is the actual content of the
customer name: XYZ
Corporation. It is terminated by an end tag: the same
tag-name, but now preceded by “/” – such as </customer_name>.
Other fields define <customer_address>,
<street>, <city>, <state> and <postcode>. Each of these tags is also terminated by an end
tag, such as </street>,
</city>, </state> and </postcode>.
The example concludes with </customer_address>
and </CUSTOMER>
end tags.
<CUSTOMER>
<customer_name>XYZ
Corporation</customer_name>
<customer_address>
<street>123
First Street</street>
<city>Any Town</city>
<state>WA</state>
<postcode>12345</postcode>
</customer_address>
</CUSTOMER>
Figure
2: A Simple XML Example
From this simple
example of XML metadata, we can see how the meaning of the text
between start and end tags is clearly defined. We can also see
that search engines can use these definitions for more accuracy in
identifying information to satisfy a specific query.
Even more
effective applications become possible For example, an
organization can define the unique metadata used by its suppliers'
legacy inventory systems. This will enable that organization to
place orders via the Internet directly with those suppliers'
systems, for automatic fulfillment of product orders. XML is
enabling technology to integrate unstructured text and structured
databases for next generation E-Commerce and EDI applications.
The following
pages now examine the XML syntax in more detail.
3.1
XML Naming Conventions
An XML document
must be “well formed”. To be well formed, a document must obey
the following rules:
§
A tag name must start with a
letter or underscore, with no spaces. Thus “person_id”
is correct, but not “person
id” or “1st name”.
§
XML names are case sensitive. For example, “PERSON”,
“Person” and “person”
are all different names.
§
Each tag must have surrounding < and > indicators, as
in the start tag <tag_name>.
§
Each start tag must also have an end tag, as in </tag_name>.
§
If a tag is empty, it must still
have an end tag or empty tag such as <CUSTOMER></CUSTOMER> or <country/> (i.e. Empty).
§
Attribute values are preceded by
an = sign and are surrounded by double or single quotes, such as version=“1.0”
standalone=“YES”.
§
The characters <, >, &,
“ or ‘ cannot be used in XML except when replaced by their
“escaped” versions. Thus the character string < represents “<” at all times until it is to be
displayed. Similarly >
is “>”, & is “&”; "e;
is “ and ' is
‘. These character sequences are called “predefined entity
references”.
A well-formed
XML document example follows in Figure 3, similar to the earlier
XML example in Figure 1.
<PERSON person_id=“p1100” sex=“M”>
(Attributes in
Element)
<person_name>
(Children of
<given_name>Clive</given_name>
“person_name”
<surname>Finkelstein</surname>
Element)
</person_name>
<email>cfink@ies.aust.com</email>
<company>
Information Engineering Services Pty Ltd
</company>
<country>Australia</country>
<phone>+61-8-9402-8300</phone>
<fax>+61-8-9402-8322</fax>
</PERSON>
Figure
3: Example of a Well-Formed
XML Document
Notice that
double quote characters in Figure 3 surround the attribute values
of PERSON, declared on
the first line with the values: person_id=“p1100” sex=“M”.
3.2
The XML Document Prolog
Every XML
document starts with an XML declaration as part of its prolog.
This declaration must be the first statement on the first line of
the document. It is defined as a processing instruction
(surrounded by <? … ?> tags) such as:
<?xml version=“1.0” standalone=“yes”
encoding=“Unicode”?>
The <?xml
specifies that the document uses XML syntax. An XML parser or
application can analyze the content of the document prior to it
being processed. The tag “XML”, “xml” or any upper and
lower case combination of this sequence of letters is reserved and
cannot be used in any tag name.
The version
number is specified for compatibility with future XML versions.
The standalone specification indicates whether a Document Type
Definition is included in-line (“standalone=yes”) or
out-of-line in an external file (“standalone=no”). We will
discuss this shortly in relation to DOCTYPE
Declarations.
The
“encoding” statement specifies the language-encoding format
used by the XML document. XML has been defined so it can be used
with any language, such as English and European languages, as well
as double byte Asian languages – Japanese, Chinese or Korean.
3.3
DOCTYPE Declarations
A Document Type
declaration (“DOCTYPE”) immediately follows the <?XML …
?> statement. Every XML document contains a root name, which
includes all other XML tag names. The DOCTYPE statement identifies
the specific root name used by the document. It also identifies
the location of the Document Type Definition (DTD) file that is to
be used with the document.
A DOCTYPE
declaration has the following formats, with examples:
<!DOCTYPE root_element_name
[ … ]>
OR
<!DOCTYPE root_element_name SYSTEM “DTD_URL”>
1.
<!DOCTYPE CUSTOMER [ … ]>
2.
<!DOCTYPE CUSTOMER SYSTEM “customer.dtd”>
3.
<!DOCTYPE supplier PUBLIC
“http://www.ind-xml.com/supplier.dtd”>
The
first example specifies that the DOCTYPE is declared internally in
the same document. We will see an example of this format shortly.
The second
example declares that an external DTD is used as a private file
(“SYSTEM”). It is the DTD file that is located at the relative
Uniform Resource Locator (URL) “customer.dtd”
within the same web site directory.
The third
example specifies that the DTD is PUBLIC. It is the DTD file at
the absolute URL “http://www.ind-xml.com/supplier.dtd”.
3.4
URL and URI
These DOCTYPE
examples use relative or absolute URLs to identify the location of
an external DTD file. But files and other resources can be moved
to different URL locations. With HTML web pages, every link that
refers to a moved resource must be updated to refer to its new
URL. HTML links can be from web sites anywhere in the world. These
can all refer to the same URL. Relocating a resource to a
different URL can therefore require considerable maintenance work.
To overcome this
problem, in time XML and XLL will enable resources to be located
instead by a Uniform Resource Identifier (URI). Distinct from a
URL, a URI can never change. XLL, with XLinks and XPointers,
define a URI. The URI always points to that resource.
3.5
XML Comments
Comments can be
used in an XML document to describe the purpose, intent and use of
different statements. Comments can also document and separate
logical sections of a document.
Comments in XML
are defined similarly to HTML comments, surrounded by <!-- …
--> tags. For example:
<!-- This is a comment and is not processed -->
Comments
can contain any data except the literal “-->" but may
not be placed inside an XML tag. In the next two examples, the
first comment is wrong; the second comment is correct:
<customer_name <!-- Defines customer name --> >
(incorrect)
<!-- Defines customer name --> <customer_name>
(correct)
However
comments can be used to surround and hide tags, such as:
<-- The following tag is used only for retail customers
<retail-code>2</retail-code>
and is ignored for
wholesale customers -->
An
XML parser or XML application cannot process the
<retail-code> tag until the surrounding comment is removed,
or until the tag is moved outside the comment. While it remains
within the comment, for XML processing purposes the tag does not
exist.
3.6
Processing Instructions
Processing
instructions (PIs) declare applications that will be used to
process part (or all) of an XML document.
Like comments, they are not part of the XML document. An
XML processor must pass a PI unchanged to the relevant XML
application. A PI has the format:
<?PI_target-name PI_data?>
The
PI_target_name
identifies the application. PI_data
following the PI_target_name
is optional; it is specified by and used by the PI application. We
saw a PI example in XML
Document Prolog. A document to be processed by an XML parser
or processor was declared by the PI statement:
<?xml version=“1.0” standalone=“yes”
encoding=“Unicode”?>
XML
applications should process only the targets they recognize. PI
names that begin with “XML”, in any combination of upper or
lower case, are reserved for use in XML standards. PIs are used
for document-specific application processing.
4.
XML Elements, Attributes and Entities
XML defines
metadata tags using elements, attributes and entities. In the
following sections we will learn how XML uses these to declare
metadata tags.
4.1
Declaring XML Elements
The tags that we
have seen are all examples of XML “elements”. An element is a
named metadata tag that is declared in a DOCTYPE statement. As we
have seen, a DOCTYPE can be defined externally in a DTD file,
located using a relative or absolute URL. Alternatively, a DOCTYPE
can be defined internally. It is included in-line, immediately
following the XML processing declaration. Figure 4 shows an
internal declaration of the DOCTYPE statement that is used with
the customer example in Figure 2:
<?XML version=“1.0” standalone=“YES”?>
<!DOCTYPE CUSTOMER
[
<!ELEMENT CUSTOMER
ANY>
<!ELEMENT
customer_name (#PCDATA)>
<!ELEMENT
customer_address EMPTY>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT postcode
(#PCDATA)>
]>
Figure
4: DOCTYPE Declaration for
the Customer Example in Figure 2
The DOCTYPE
statement in Figure 4 defines the XML root name as
<CUSTOMER>. Square left and right brackets ([ … ]) follow,
surrounding all element declarations. The DOCTYPE root name
<CUSTOMER> is declared as an ELEMENT, specified to be ANY
(case-sensitive). This indicates that any element, as well as
parsed character data (shown as italics in Figure 2) can appear in
a <CUSTOMER> element.
Each element
must be uniquely named within an XML document. As a DOCTYPE
declaration can be defined using more than one DTD, the concept of
namespaces has been included in XML. This allows an alias to be
assigned to a DTD. The namespace alias can then be used to qualify
named elements that would otherwise violate this rule, so ensuring
uniqueness.
The declaration
of an XML Element in Figure 4 has the format:
<!ELEMENT element-name
content_type>
As
with all XML tags, the element-name
starts with a letter or underscore, can have no spaces and all
names are case sensitive. Thus Customer,
customer and CUSTOMER
are different XML element-names. Because of this and to avoid
confusion, it is recommended that an element-name be declared
using a case-sensitive name that always refers to that same
element. Once declared, that element-name
is used as the tag-name;
the terms element-name
and tag-name are therefore synonymous.
The content_type
of an Element can have values of ANY, EMPTY, (#PCDATA), or a
(Child List) as discussed next.
We discussed an
example of ANY in relation to the root name element
<CUSTOMER> in Figure 4. This indicates that any element, as
well as parsed character data, can appear within it.
By default, tags
are non-empty and are
followed by data (see italics in Figure 2). An element is
declared EMPTY if it normally has no data. For example, the
element <customer-address> in Figure 2 contains
<street>, <city>, <state> and <postcode>
elements within it. These are called child elements. As the parent
element, the data for <customer-address> is provided by its
child elements. The <customer-address> element is therefore
declared in Figure 4 to be EMPTY.
Figure 4
declares the <customer_name> element is (#PCDATA). This
specifies that the element contains “Parsed Character Data”.
In Figure 2, we now know that XYZ
Corporation is character data that is parsed by an XML parser,
or processed by an XML application.
Similarly, the
<street>, <city>, <state> and <postcode>
elements in Figure 4 are also declared to be (#PCDATA). Note that
<postcode> in Figure 2 contains the numeric characters:
“12345”, not the numeric value. For example, consider the
following element declaration and corresponding tag:
<!ELEMENT
customer_balance (#PCDATA)>
…
… …
<customer_balance>$15,500.00</customer_balance>
Before
the <customer_balance> data of $15,500.00 can be processed
by an XML application, it must first be converted from the numeric
characters “$15,500.00” to the numeric currency value of
$15,500.00.
When we consider
address data in an application, there are many variations. For
example, in addition to the street number and name, some customers
may have a floor or level number, and/or an apartment, suite or
flat number. We could define each of these as separate elements
within <CUSTOMER>. But
this data could be considered as part of the normal content for
the <street>
element. Some customers may need two or more <street> elements. For others, <postcode> may not be available and so could be omitted.
To this point
there is nothing in the declaration of CUSTOMER to control whether
any or all of the declared elements must exist. We provide extra
control by specifying a content
model.
<?XML version=“1.0” standalone=“YES”?>
<!DOCTYPE
CUSTOMER
[
<!ELEMENT CUSTOMER
ANY>
<!ELEMENT
customer_name (#PCDATA)>
<!ELEMENT
customer_address (street, city, state, postcode)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT postcode (#PCDATA)>
]>
Figure
5: Constraints added to the
CUSTOMER DOCTYPE Declaration
Figure 5
indicates that <customer_address> has a (child
list). This child list is a content model, which specifies
that <customer_address>
has child elements of (street,
city, state, postcode). This comma-delimited format indicates
that each customer address has only one <street>,
<city>, <state> and <postcode>
element. Each element must appear in the specified sequence.
When a child
list is defined using commas, each element is mandatory and must
exist in that sequence. Alternatively, if any elements validly may
not exist, they can be separated by “|” to indicate
optionality, such as:
<!ELEMENT customer_address (street | city | state |
postcode)>
If
we also add a <contacts>
element, with child elements of <phone>,
<fax>, <mobile> (cell phone) and <email>
elements we find even more variations. We therefore need to be
able to specify the number of occurrences of child elements that
are valid.
<?XML version=“1.0” standalone=“YES” ?>
<!DOCTYPE
CUSTOMER
[
<!ELEMENT CUSTOMER
ANY>
<!ELEMENT
customer_name (#PCDATA)>
<!ELEMENT
customer_address (street+, city, state, postcode?)>
<!-- ? = zero or one; * = zero or more; + = one or more
-->
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT postcode (#PCDATA)>
<!ELEMENT contacts
(phone+, fax*, mobile?, email?)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT fax (#PCDATA)>
<!ELEMENT mobile (#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>
Figure
6: Further constraints added
to the CUSTOMER DOCTYPE declaration for customer_address and
contacts.
Figure 6 adds
validity constraints to child elements by including a suffix
character attached to the child name. A suffix of “?”
specifies that zero or one occurrence of the child element may
exist within the parent element. A suffix of “*” specifies
that zero or more occurrences may exist, while a suffix of “+”
specifies that at least one or more occurrences of the relevant
child element must exist within the parent element. No suffix
indicates that the element must exist only once.
Examining Figure
6 further, we see that <customer_address>
must have at least one <street>
element, but it can have more street
occurrences (street+). There must be only one <city> and one <state>
element (no suffix). But the <postcode>
element is optional; there may be none, or one occurrence
(postcode?). The comma delimiters specify that the elements must
appear in the declared sequence.
We may want to
show that a valid customer_address
can have several addresses within it. We can place these child
elements all within brackets, with the relevant group suffix
character following the right bracket. We also use surrounding
brackets to group other elements.
<!ELEMENT customer_address
(street+ | (city, state) | postcode)+>
The
above fragment indicates by outer brackets with a suffix “+”
that there must be at least one or more group of addresses. Within
an address group, there must be one or more street
elements (street+) OR a city
element followed by a state
element OR a postcode element).
Of course, all elements can also exist in the above example.
In Figure 6 we
also saw new element declarations of <contacts>:
<phone>; <fax>;
<mobile>; and <email>.
We see that <contacts>
has a content model with child elements of (phone+,
fax*, mobile?, email?)>.
Based on the
suffix attached to each of the <contacts>
child names, Figure 6 specifies that there must be at least one or
more <phone>
occurrences (phone+) and zero or more <fax>
occurrences (fax*). There can be zero or one <mobile>
occurrence (mobile?), and also zero or one <email> occurrence (email?).
We can use a
content model that includes PCDATA. For example, we can
alternatively specify <phone> and <fax> by the
fragment:
<!ELEMENT phone (#PCDATA | (country-code, area-code,
phone-number))*>
<!ELEMENT fax (#PCDATA| (country-code,
area-code, phone-number))*>
<!ELEMENT country-code (#PCDATA)>
<!ELEMENT area-code (#PCDATA)>
<!ELEMENT phone-number (#PCDATA)>
There
can be zero or more <phone>
and <fax> – by
the suffix “*” after the outer brackets. These can contain
parsed character data, or they may optionally have a child element
group in the sequence of (<country-code>,
<area-code> and
<phone-number>).
All content models that include PCDATA must have this format:
PCDATA must come first, vertical bars must separate all elements
or element groups, and the entire outer group must be optional.
These
constraints enable an XML Parser or XML application to confirm the
validity of the document, by checking the number of child element
occurrences within each parent element. They validate these
occurrences against those specified by the child list constraints
for the parent element in an internal DOCTYPE declaration, or a
DOCTYPE in an external DTD.
4.2
Declaring XML Attributes
An element may
contain one or more attributes to provide additional details about
that element. Figure 3 earlier included an example of attributes
for the PERSON element. This specified a PERSON occurrence, with a
unique identification attribute called person_id
and another attribute called
sex, repeated now in Figure 7.
<PERSON person_id=“p1100” sex=“M”>
Figure
7: The PERSON Element, with
Attributes
This example
shows that attributes and their values are enclosed within the
< and > characters of the start tag for an element,
immediately following the element name.
Each attribute
of an element is declared in DOCTYPE ATTLIST, using the format in
Figure 8. The ATTLIST format specifies the element_name
and then defines an attribute_name
as a unique XML name within all of the element’s attributes. It
observes all of the rules detailed earlier in XML
Naming Conventions.
<!ATTLIST element_name attribute_name
type “default
value”>
Where
type = (CDATA
| ID | IDREF | IDREFS | ENTITY | ENTITIES | NMTOKEN | NMTOKENS |
NOTATION)
Figure
8: The ATTLIST Format
The type
specification in Figure 8 is defined from an enumerated list of
valid values: (CDATA | ID | IDREF | IDREFS | ENTITY | ENTITIES |
NMTOKEN | NMTOKENS | NOTATION).
CDATA represents
“Character Data”, as a character data type that is non-markup
text. This is somewhat analogous to a Data Definition Language
(DDL) SQL data type of VARCHAR, as used by DBMS products.
Note that XML
does not support the other DDL data types such as numeric or decimal (with a
defined length and precision), or money,
currency or CHAR
(with a defined length), or float, bit, Boolean or other data types. XML is used and read as text. An XML application must convert and
validate these other data types.
We will continue
with the other type
declarations for Figure 8. ID represents an identifying attribute
such as a primary key, with a unique name within the element.
There can only be one attribute in an element that is specified
with a type of ID.
Where an element
must have a compound primary key for uniqueness, a single unique
primary key is defined. In data modeling this is called a
“surrogate key”. The compound primary keys are instead defined
as foreign keys with a type of IDREF or IDREFS. These are discussed shortly.
Furthermore, the
value of each ID attribute must be unique for all occurrences of
the relevant element. This follows the uniqueness rule of primary
keys: a primary key cannot have duplicates. The earlier PERSON
example has a unique value of “p1100” for the attribute named person_id,
repeated as Figure 7.
An attribute can
be defined as a foreign key, with a type
of IDREF. Or several attributes can all be specified as foreign
keys, each with a type
of IDREFS. This offers more flexibility. It is used to specify
many foreign keys. The referenced IDREF attribute name must also
exist elsewhere, in an element where it is also declared as an ID
or IDREF attribute. As we discussed earlier, IDREF or IDREFS can
be used to specify compound primary keys, where a single primary
(surrogate) key is specified with a type
of ID.
In Figure 8 the
type declarations ENTITY and ENTITIES define an attribute
name, or attribute names, with associated substitution text. These
declare entity references. The defined entity name can be used as
a shorthand notation, analogous to a macro; it is replaced by the
substitution text wherever it is used. Entities can be used within
the main body of the XML document, or in a DTD. We will cover Entity
Declarations shortly.
Note that the
use of ENTITY and ENTITIES by XML is different to the use of these
terms in data modeling and normalization.
NMTOKEN and
NMTOKENS types specify
that the value of an attribute must be a valid XML name (NMTOKEN)
or valid multiple XML names (NMTOKENS). A program can use an
attribute of this type to manipulate XML data. For example, it can be used to
associate a Java class with an element. A Java API can then be
used to pass the data to a method for that class.
A NOTATION type
typically is used to specify an application to process an unparsed
value of an attribute. A NOTATION attribute is associated with a
NOTATION declaration in a DTD. This declares the specific
application program name to be invoked. We saw earlier that
applications can be declared in a Processing Instruction (PI).
This declares the PI_target_name
as the application, with associated PI_data.
We will now
discuss the “default value”
specification in Figure 8. This is used to define a list of valid
values for an attribute, or it can declare an attribute as being #REQUIRED,
#IMPLIED or #FIXED.
For example,
attributes of PERSON in Figure 7 are specified by an ATTLIST
declaration in Figure 9. We can see that person_id
is an ID attribute. Every PERSON occurrence must have a unique person_id
value. Further, this ID attribute is mandatory (#REQUIRED).
<!ELEMENT PERSON EMPTY>
<!ATTLIST PERSON
person_id
ID #REQUIRED>
<!ATTLIST PERSON
sex (M |
F) #IMPLIED>
<!ATTLIST PERSON
status (employee | trainee)
“employee”>
<!ATTLIST PERSON
company CDATA
#FIXED “XYZ”>
Figure
9: A List of Valid Attribute
Values
The attribute sex
in Figure 9 has valid values of “M” (Male) or “F”
(Female). Any other values are invalid. This attribute example is
#IMPLIED. It is not mandatory for a value to be supplied. The sex
attribute can be omitted if it is not known.
A default value
can be provided if an attribute is able to be omitted. In the
example, status can only
have valid values of “employee” or “trainee”. If not
specified, status
defaults to “employee”.
Finally, an
attribute can be declared as #FIXED. This allows a default value
to be supplied for an attribute, which cannot be changed.
Figure 9 shows that company
is character data (CDATA). It has a default value (#FIXED). This
attribute is not provided in a document. It is automatically
supplied as the value “XYZ”.
Another example
of element and attribute declarations is provided in Figure 10.
This defines a PHOTO element in XML, so it can be used by HTML to
display an image on a web page. The src
attribute specifies the location of the photo image source file.
It contains character data (CDATA) and is mandatory (#REQUIRED).
The width, depth, border
and alt specify the
image dimensions and border thickness, as well as alternate text
that is displayed while the image file is being transmitted. These
are all character data (CDATA) and are optional (#IMPLIED).
<!ELEMENT PHOTO EMPTY>
<!ATTLIST
PHOTO src CDATA
#REQUIRED>
<!ATTLIST
PHOTO width CDATA #IMPLIED>
<!ATTLIST
PHOTO depth CDATA #IMPLIED>
<!ATTLIST
PHOTO border CDATA
#IMPLIED>
<!ATTLIST
PHOTO alt CDATA
#IMPLIED>
Figure
10: Attribute Declarations
for the PHOTO Element
4.3
Valid XML Documents
An XML document
must not only be well formed as discussed earlier, it must also be
valid. An XML document is valid if the document tags and their
data content agree with the ELEMENT and ATTLIST declarations in
the Document Type Definition (DTD). We discussed that a DTD is
analogous to a DDL schema for a DBMS, but with different syntax. A
DOCTYPE declaration for the earlier PERSON examples, together with
the defined document tags and data content is shown in Figure 11.
From Figure 11,
we see that a PERSON document has two attributes: a person_id which must be unique (ID #REQUIRED) and sex.
This is an optional attribute (#IMPLIED), but if provided it can
only have the values (M | F).
A PERSON must
have at least one or more name
(name+). A name has zero
or more given_name (given_name*)
and at least one or more surname
(surname+). The document shows examples of these tags with the
relevant data content.
A PERSON can
have zero or more email
addresses (email*), zero or one company,
country or fax number (company?, country?, fax?), and zero or more phone
or mobile numbers (phone*, mobile*). We can see in Figure 11 that two
phone numbers and two mobile numbers are provided as part of the
PERSON document content. The data tags and content agree with the
DOCTYPE declaration. The PERSON root name and its contents
therefore comprise a valid XML document.
4.4
Entity Declarations
XML uses the
term ENTITY to declare a substitution name for insertion of
predefined values. This is quite different from the use of the
term “entity” is data modeling. There are two types of entity
declarations. The first is a General Entity declaration. This can be used inside the main body of
an XML document or in a DTD section, where it is called an internal
entity reference. It can be used externally to a document,
when it is called an external entity reference. A general entity reference is
distinguished by a prefix “&”.
<?xml version=“1.0” standalone=“yes”?>
<!DOCTYPE PERSON
[
<!ELEMENT PERSON
(name+, email*, company?, country?, phone*, fax?,
mobile*)>
<!ATTLIST PERSON person_id ID
#REQUIRED>
<!ATTLIST PERSON sex (M | F) #IMPLIED>
<!ELEMENT name (given_name*, surname+)>
<!ELEMENT given_name (#PCDATA)>
<!ELEMENT surname (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT country (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT fax (#PCDATA)>
]>
<PERSON person_id=“p1100” sex=“M”>
<person_name>
<given_name>Clive</given_name>
<surname>Finkelstein</surname>
</person_name>
<email>cfink@ies.aust.com</email>
<company>Information Engineering Services Pty
Ltd</company>
<country>Australia</country>
<phone>+61-8-9402-8300</phone>
<phone>(08) 9309-6163</phone>
<fax>+61-8-9402-8322</fax>
<mobile>+61-411-472-375</mobile>
<mobile>0411-472-375</mobile>
</PERSON>
Figure
11: A Valid Internal DTD,
with defined XML tags and data content
We saw internal
entity references earlier, in XML
Naming Conventions. XML supplies five predefined entities –
< (which is replaced by <), > (by >),
& (by &), " (by “) and ' (by ‘).
The replacement text replaces an internal entity reference only
when it is displayed, or is about to be processed by an
application. For example, the internal entity reference “&IES;”
can be declared with text “Information Engineering Services Pty
Ltd”. This text automatically replaces “&IES;” wherever
it occurs – but only when that entity is displayed or passed to
an application. Internal entities can contain references to other
internal entities, but they cannot be recursive.
Distinct from an
internal entity reference, the replacement text for an external
entity reference immediately replaces that entity wherever it
occurs. An XML parser or processor processes the replaced text as
if it was an original part of the document.
An entity name
must be a unique XML name. It is declared together with the
replacement text. This text is substituted for the entity wherever
it occurs. Figure 12 shows the internal and external format and
examples for a general entity.
Format:
<!ENTITY entity_name “replacement text”>
(Internal)
<!ENTITY entity_name SYSTEM “URL”>
(External)
Declaration
Examples:
<!ENTITY IES “Information Engineering Services Pty Ltd”>
(Internal)
<!ENTITY copy99 “© Copyright 1999”>
(Internal)
<!ENTITY ref1 SYSTEM “http://www.ref.com/ref1.xml”>
(External)
Usage
Examples:
“&IES;” is replaced later by “Information
Engineering Services Pty Ltd”
“©99;” is replaced later by
“© Copyright 1999”
“&ref1;” is replaced immediately by the
content of document “
ref1.xml”
Figure
12: General Entity
Declaration and Usage Examples
The format and
two examples of an internal entity are illustrated in Figure 12.
The first declares “&IES;”
as an internal entity reference for
“Information Engineering Services Pty Ltd”. The second
declares “©99;” as
a shorthand for the text “©
Copyright 1999”. Whenever “&IES;” or “©99;” are found internally within a document they are
replaced by that text, but only when the document is about to be
processed or displayed.
The third
example in Figure 12 declares an external entity “&ref1;”
as a shorthand
reference for the document “ref1.xml”.
This is located externally at “http://www.ref-xml.com/ref1.xml”.
(This is a fictitious URL.) Because it is an external entity
reference, it is immediately replaced by the content of the
document “ref1.xml”.
An entity can be
a convenient shorthand way of including a much larger amount of
text, as shown in Figure 12. It also provides an XML document with
a single point for declaration of text that can change. If
volatile text appears in many places of an XML document, a general
entity can be used in each place. The replacement text is defined
once only when the entity is declared. Whenever that text is later
changed, the updated text automatically replaces every occurrence
of that entity.
We discussed
that there are two types of entities. The second type is a Parameter Entity declaration, which can only be declared inside a
DTD. A parameter entity reference is distinguished by a prefix
“%”.
Figure 13 now
shows the declaration format and examples of a parameter entity,
which uses a prefix “%” – distinguished from general
entities that use a prefix of “&”. A parameter entity is
declared in a DTD, which can be internal or external. If declared
in an internal DTD, it is used within that same document similar
to a general entity. If declared in an external DTD, it references
a URL where the DTD exists.
In the first
example of Figure 13, the % character – followed by a space –
declares that person is
an external Parameter Entity. It specifies that content for “%person;”
(no spaces) is located in the DTD file “person.dtd”.
The content of this DTD file immediately replaces “%person;” as if it was an original part of the document.
Format:
<!ENTITY % entity_name “replacement text”>
(Internal)
<!ENTITY % entity_name SYSTEM “URL”>
(External)
Declaration
Format:
<!ENTITY % person SYSTEM
“person.dtd”>
(External)
<!ENTITY % idr
‘ID #REQUIRED’ >
(Internal)
Usage
Examples:
<!DOCTYPE PERSON SYSTEM
%person;>
(External)
… … …
<!ATTLIST PERSON person_id %idr;>
(Internal)
Figure
13: Parameter Entity
Declaration and Usage Examples
The second
example declares “%idr;”
as an internal Parameter entity that is to be immediately
replaced by the text ‘ID #REQUIRED’. The example shows an
ATTLIST declaration for
person_id (as an attribute of
PERSON) with “%idr;” as an internal Parameter Entity. This is replaced immediately by “ID #REQUIRED”, as if it had
been written:
<!ATTLIST PERSON person_id ID #REQUIRED>
Any amount of
replacement text can be declared for general entities and for
parameter entities. This text is surrounded by quotes. As we have
seen, entities can be declared to insert fragments or complete
paragraphs of standard “boiler-plate” text in a document. That
insertion is immediate for parameter entities or external general
entities. Insertion is deferred for internal general entities; the
entity is replaced by the text only when it is about to be
displayed or passed to an application for processing.
4.5
XML Resolves Many HTML Problems
Earlier we
discussed a number of problems associated with the use of HTML.
XML resolves many of these problems, as summarized next. This
summary also concludes the main points of the paper.
1.
XML
defines content of page: We now know that XML offers a
powerful way to define tags describing the content of a document.
This document can be unstructured text, or it can be graphics,
images, audio or video files, or it can be structured data in
legacy files, relational or object databases.
2.
Search
engines can locate XML content: Search engines can
precisely locate required content based on defined XML tags. This
content has more meaning than earlier search methods that rely
only on word indexes or manually defined keywords.
3.
XML
can integrate dissimilar data sources: XML can be used to
define the structure of legacy files, relational and object
databases, as well as unstructured text, graphics, images, audio
or video. This makes it easier to integrate data content sourced
from dissimilar systems and databases.
4.
Easier
dynamic programming: XML and Document Object Model (DOM)
simplify programming to incorporate dynamic content from different
data sources. DOM offers a language-independent interface for
processing XML documents.
5.
Easier
interfacing with back-end systems: XML and DOM have been
designed to interface with back-end systems. Special Markup
languages can easily be defined for standard definition and
processing of data within industries and communities with common
interests based on agreed metadata definitions.
5.
Author
Clive
Finkelstein, acknowledged worldwide as the "Father" of
Information Engineering, is Managing Director of Information
Engineering Services Pty Ltd in Australia. He is the Chief
Scientist of Visible Systems Corporation in the USA and is
Managing Director of Visible Systems Australia Pty Ltd. He has over
45 years' experience in the Computer Industry.
This paper is
based on extracts from his book: “Building
Corporate Portals with XML”, co-authored with Peter
Aiken, published by McGraw-Hill (Sep 1999).
He
has published many books and papers throughout the world including
the first publication on Information Engineering: a series of six
InDepth articles in US ComputerWorld in May - June 1981. He
co-authored with James Martin the influential two-volume report
titled: "Information Engineering", published by the
Savant Institute in Nov 1981. He wrote two later IE books: "An Introduction to Information Engineering",
Addison-Wesley (1989); and "Information Engineering : Strategic Systems Development", Addison-Wesley (1992). He has
contributed Chapters and Forewords to books published by
McGraw-Hill ["Software Engineering Productivity
Handbook" (1992) and Foreword: "Data Reverse Engineering: Slaying the Legacy
Dragon", Peter Aiken (1996)], and by Springer-Verlag
["Handbook on Architecture of Information Systems"
(1998)].
His latest book is
“Enterprise
Architecture for Integration: Rapid Delivery Methods and
Technologies”, by Clive Finkelstein, Artech House, Norwood MA
(March 2006)
His current
focus helps organizations to evolve from Data Warehouses and Data
Marts to Corporate Portals (also called Enterprise Portals) using
the Extensible Markup Language (XML). These provide a central
gateway to the information and knowledge resources of an
enterprise on its corporate Intranet and via the Internet.
Enterprise Portal, XML and related technologies and products will
rapidly become available over the next 2 – 5 years. Enterprise
Portals will be the central computing focus and interface for most
enterprises in the 21st century.
Clive writes a
monthly column, "The Enterprise" for DM Review magazine
in the USA and also publishes a free, quarterly technology
newsletter via email: "The Enterprise Newsletter (TEN)".
Past issues of TEN, and of the DM Review Enterprise column, are
available from http://www.ies.aust.com/articles.htm.
6.
References
1.
Document Object Model (DOM) Specifications – http://www.w3.org/TR/REC-DOM-Level-1/
2.
Finkelstein, Clive and Aiken, Peter (1999), “Building
Corporate Portals with XML”,
McGraw-Hill, ISBN: 0-07-913705-9. Covers the design, development
and deployment of Data Warehouses and Enterprise Portals using XML
(560 pages).
3.
Hackathorn, Richard (1998), “Web
Farming for the Data Warehouse”, Morgan Kaufman, ISBN:
1-55860-503-7. Includes use of XML for data sources from Internet
(368 pages).
4.
Harold, Elliotte Rusty (1998), “XML:
Extensible Markup Language”, IDG Books, ISBN:
0-7645-3199-9. Covers XML, XSL and XLL with HTML (426 pages +
CD-ROM).
5.
Holzner, Steven (1998), “XML
Complete”, McGraw-Hill, ISBN: 0-07-913702-4. Covers XML
with focus on Java (516 pages + CD-ROM).
6.
XML Namespaces Specifications – http://www.w3.org/TR/REC-xml-names/
7.
Resource Description Framework (RDF) Specifications – http://www.w3.org/Metadata/RDF/
8.
St Laurent, Simon (1998), “XML:
A Primer”, MIS Press [IDG Books], ISBN: 1-55828-592-X. A
good basic introduction to XML (348 pages).
9.
Extensible Markup Language (XML) Specifications – http://www.w3.org/XML/
10.
http://svc004.bne009i.server-web.com/catalogue/visible/default.shtml
lists many XML books that can be purchased
directly from Amazon.com.
11.
W3C: WWW Consortium and associated specifications – http://www.w3.org/
6.1
XML Information Web Sites
1.
James Clark’s XML Web Page – http://www.jclark.com/xml/
2.
James Tauber’s XMLINFO Web Site – http://www.xmlinfo.com/
3.
Microsoft XML Scenarios Web Site – http://microsoft.com/xml/scenario/intro.asp
4.
Microsoft XML Site – http://www.microsoft.com/xml/
5.
Microsoft XML Workshop Web Site – http://www.microsoft.com/workshop/xml/toc.htm
6.
Robin Cover’s XML Resources – http://www.sil.org/sgml/xml.html
7.
Web Farming Web Site – http://www.webfarming.com/
8.
WWW Consortium: Specifications and Standards – http://www.w3.org/
9.
XML.com Web Site – http://www.xml.com/
6.2
XML Development Tools: Validating Parsers
1.
Data Channel’s DXP Parser – http://www.datachannel.com/products/xdk/DXP/index.html
2.
IBM’s Alphaworks XML for Java Parser – http://www.alphaworks.ibm.com/formula/xml
3.
Microsoft’s MSXML Parser – http://www.microsoft.com/standards/xml/xmlparse.htm
4.
Object Design’s eXcelon XML database – http://www.objectdesign.com/
6.3
XML Development Tools: XML Browsers
1.
Microsoft Internet Explorer 5.0 – http://www.microsoft.com/
2.
Netscape Communicator 5.0 – http://www.netscape.com/
3.
Netscape’s Mozilla Browser – http://www.mozilla.org/
4.
Peter
Murray Rust’s Jumbo Browser – http://vsms.nottingham.ac.uk/vsms/jumbo
|