Introducing XML  

The XML buzzword is on everyone’s lips, but pinning down what it is really about can be hard. In this feature, Tim Anderson explains XML and why it matters. Adapted from an article first published in Personal Computer World.

XML is hugely important. Dr Charles Goldfarb, who was personally involved in its invention, claims it to be “the holy grail of computing, solving the problem of universal data interchange between dissimilar systems.” It is also a handy format for everything from configuration files to data and documents of almost any type. Through SOAP (Simple Object Access Protocol) or XML RPC, you can also use it to invoke methods on remote objects, to do true distributed computing over the Internet. So it is significant; but it is a hard subject to pin down, because it describes a whole family of technologies and specifications. Most XML is programmatically generated, and you do not need to learn all the specifications in order to benefit from XML, any more than you need to learn HTTP (Hypertext Transfer Protocol) to use the World Wide Web. What matters is to understand what XML does, so that you can see how to exploit it in your own projects.

How it all began

In the 1970’s, three guys at IBM (the aforementioned Charles Goldfarb, along with Ed Mosher and Ray Lorie) invented GML, a way of marking up technical documents with structural tags. The initials stood for Goldfarb, Mosher and Lorie. According to Goldfarb, he invented the term “mark-up language” in order to make better use of the initials, so it became the Standard Generalised Markup Language and was adopted by the ISO in 1986. It is a little confusing, because SGML is not itself a markup language, but rather a specification for defining markup languages. The application of SGML that became well known is HTML (Hypertext Markup Language). HTML is an SGML application, and defines a specific set of tags suitable for web pages.

Although it took off like wildfire, HTML is where things began to go horribly wrong. The original thinking was to separate content from presentation. For example, the <em> tag in a web page means “emphasise”. It was left up to the user agent how to render that, say as bold text, or in a different colour, or with a different tone of voice in a speech reader. This type of thing does not please page designers, who want to nail down the exact appearance of a page. Therefore HTML got extended with things like <font> tags which went right against the initial concept. Another problem area was that fierce competition between Netscape and Microsoft led to fragmentation of the standard, which remains a huge problem for web developers. Web pages began to be used for things that went wildly beyond the original concept, including multimedia, animation, online applications, ecommerce and more. Browsers also tried to be tolerant of hastily written web pages that committed crimes like using an opening tag without a corresponding closing tag. Tolerance is normally commendable, but the resulting lack of discipline became a barrier to programmatic interpretation of web content, or the use of HTML for structured data.

In a nutshell, HTML is too limited and terminally polluted, while SGML itself is reckoned to be too complex for mortals to implement. In the late 1990s a group of people including Jon Bosak, Tim Bray, James Clark and others came up with XML, eXtensible Markup Language. Like SGML, XML is not itself a markup language, but a specification for defining markup languages. The W3C (World Wide Web Committee) immediately set about reshaping HTML as an XML application, with the result being XHTML. That is only one small part of what XML is all about. The key point is that using XML the industry can specify how to store almost any kind of data, in a form that applications running on any platform can easily import and process.

The Microsoft factor

In the mid 1990s Sun Microsystems introduced Java, with the ability to applications securely on any supported platform. One early use was to create applets, applications designed to run safely in web browsers. But Java’s warm adoption across the industry is not much to do with applets, and only a little to do with its strong and productive language features. Rather, Java helped companies like IBM make sense of their diverse range of operating systems. Having each system running Java greatly simplifies the business of creating interoperable applications. Another example is Oracle, which uses Java stored procedures as an ideal solution for its cross-platform database. Probably in response to the prospect of a Java-centric computing universe, Microsoft picked on XML as an alternative approach to the interoperability puzzle, and became XML’s greatest advocate. Unlike Java, which is controlled by Sun, XML is in the hands of the independent W3C, a factor which endeared it to Microsoft. The significance of XML to Microsoft is only now becoming clear, with the company describing its .Net initiative as “a platform for XML web services”. Through XML, Microsoft’s applications can communicate with those running on other platforms. A Java application can employ the services of a COM object (COM being Microsoft’s Windows-specific object technology), and vice versa. Hence Microsoft has been busy creating XML interfaces to its server products, such as SQL Server and Exchange.

Microsoft emphatically does not own XML, and the technology has transcended politics by virtue of its sheer usefulness. IBM is a big XML user, while listening to Sun you would think it to be a Java technology. The fact that XML is important to all three companies says a lot for its bridge-building potential.

No magic

Now that XML is a buzzword, it is vulnerable to abuse by marketers who pretend that “save as xml” is a virtue in itself. The mere fact of a document being in XML is no guarantee of usefulness. For example, Microsoft Visio can save drawings as XML. These are large documents with hundreds of Visio-specific elements and attributes. Just because it is XML does not mean that AutoCAD or Adobe Illustrator can make sense of it. It might make it easier for other vendors to create an import filter; but the real benefit will come if and when Microsoft and other drawing application vendors sit down to thrash out an agreed XML standard for drawing documents. With XML, standards are everything.

The heart of XML

An XML document is a tree of nested elements, each of which can have none or more attributes. There can only be one root element. Each element has an starting and ending tag, marked by angle brackets, with content in between:

<element>…content…</element>

The content can contain other elements, or can consist entirely of other elements, or might be empty.

Attributes are named values which are given in the start tag, with the values surrounded by single or double quotations:

<element attribute1="value1" attribute2="value2">

This is the essence of XML, and it’s nice and simple. Here’s an example that will look familiar:

<html>
<body background= "mypic.gif">
<h1>HTML or XML?</h1>
</body>
</html>

HTML or XML? It is valid as either, and illustrates the close relationship between the two (however, note that not all HTML documents are valid XML).

There are a few other fundamentals in XML, such as processing instructions and namespaces, but elements and attributes are the heart of it. If you are familiar with object-oriented programming, it might help to think of elements as objects and attributes as properties. When designing XML applications, it can be hard to decide what should be an element and what should be an attribute. For example, instead of <body background="mypic.gif"> why not have:

<body>
<background>mypic.gif</background>
</body>

To some extent this is a matter of taste. Some XML is more element-centric, some more attribute-centric.

Validating XML

A well-formed XML document is one that conforms to rules, such as having only one root element, all start tags have matching end tags, elements may not overlap, and so on. You can make up elements and attributes as you go along, and still end up with a well-formed document. It is usually more useful to validate the document according to an agreed schema, of which XHTML is an example. The schema (with a small “s”) defines what elements may appear, what attributes they may have, and other constraints such as what is optional and what is required. The standard way to do this is with a DTD (Document Type Definition). DTDs have limitations and are not themselves XML documents, so more recently a more powerful alternative called XML Schema (with a capital “S”) has been agreed. A key advantage of XML Schema is support for strong data types, such as string, float, boolean, decimal and dateTime. XML Schema is the future, but DTDs are sill valid and will be around for a long time. Valid XML is both well-formed and validated by conformance to a specified schema.

Continue to part 2: The parts of an XML document

Copyright Tim Anderson January 2004. All rights reserved.
You are welcome to post links to this article. If you wish to print, distribute, or copy all or part of it, please contact me for permission, which may be subject to a fee.