The primary function of XML is to consume RAM and datacommunication bandwidth. Presumably it was promoted to its
current frenzy by companies who sell either RAM or bandwidth. Others promoting it have patents they hope to spring on
the public once it is entrenched. XML is the biggest con game going in computers. You probably guessed, I am known for
my rabid dislike of XML.
The Basics
XML is the Extensible Markup Language,
a W3C proposed recommendation. Like HTML, XML is based on SGML, an International Standard (ISO 8879) for creating markup
languages. However, while HTML is a single SGML document type, with a fixed set of element type names (AKA "tag
names"), XML is a simplified profile of SGML: you can use it to define many different document types, each of which
uses its own element type names (instead of HTML’s html , body,
h1, ol, etc.). For example, in XML, you can markup an online
transaction like this:
Fields that there can be only zero or one of are usually specified as attributes e.g. unit= "box".
Fields that there can be many of are enclosed in tags e.g. <item>…</item> e.g.
Just like HTML, comments begin with <!-- and end with -->. You
can abbreviate <mytag myattrib="something"></mytag>
as <mytag myattrib="something" />.
XML was designed to make it easy to write a parser. I think this was an unfortunate decision. Only a handful of people
in the world will ever write an XML parser, but hundreds of thousands have to compose XML. They should have designed it
to be easy and terse to write. For example, its mandatory quotes around each field are there solely for the convenience
of the parser writer. The tag names in the </mytag;> are redundant, and should be
optional. They are not needed at all in XML designed solely for machine consumption. Even in human-read XML, they add
nothing on the innermost nest on a single line.
Encoding
UTF-8 is the default encoding, but unfortunately the encoding could be any ruddy encoding ever invented. Using other
encodings destroys XML as an interchange format. Don’t do it!
Schemas
You describe your little XML subgrammar by writing a DTD (Document Type
Definition) file. Optionally, you can include the DTD inline inside your XML file. There
are other more elaborate schema grammars including RELAX NG, Schematron,
XSD and various other schemes.
Validation
Each schema has its corresponding technique for validating an XML file that the syntax is valid. If you use a DTD, here
is how to do it.
Parsing
There are two popular parsing techniques, SAX (Simple
API for XML), which hands you each field as it parses,
and W3C DOM (Document Object Model)
tree which creates a complete parse tree you can prune and repeatedly scan.
I personally detest XML, however, it has caught on like a cocaine wave. It must have some redeeming features.
XML Benefits
- XML is the latest fad. Almost every program is learning to import and export data in XML format, which makes it a lot
easier to glue programs created by different people together.
- It unifies the grammar of thousands of little files so that you don’t have to learn the syntax quirks of each one.
- It is relatively easy to whip up a DTD to describe an XML grammar for some little data file. That DTD is all you need to
generate a parser.
- The XML files can be viewed or composed by humans using a text editor.
- XML is about as simple a grammar as you can get.
- XML can work with almost any 8-bit or 16-bit character set.
- XML is good at handling hierarchical data.
- You can have Pick OS-like data, with arbitrarily long fields, and arbitrarily repeated fields.
- XML is platform independent. It has no big-little endian problems.
- It is possible to parse XML without writing a DTD. This process presumes the XML file is perfectly formed.
- XML search engines can take into account the tag context, e.g. "Washington" inside tag <state>, <president>,
<mountain>, <moviestar>. An XML search engine can show you want tags in found and let you choose the
relevant ones.
- XML settles on Unicode character encoding to allow transmitting data in any language, though it does require clumsy
entity encoding/decoding.
- A program does not need to understand the entire structure of a file. It can just pick out the tags of interest. This
means new tags can be easily added without disturbing existing software that uses the file.
XML Drawbacks
- XML is incredibly fluffy and repetitive. It wastes bandwidth in transmission. You must compress it. Happily, ZIP-style
compression works very well on XML. Unfortunately, you have to fluff it back up to process it, wasting RAM with
unprecedented abandon. In practice no one does compress it.
- It takes up huge amounts of RAM and disk space to store it.
- The DOM parse tree considers every space significant, even spaces between tags, even spaces for indenting, even
trailing spaces on a line, even double spaces embedded in data.
- There is no mechanism to describe the types of the data. To XML, everything is a string. There is no way to specify a
field must be numeric, that in needs two decimal places, that it must represent a date in some range, that it must not
have accented letters, that it be restricted to certain punctuation, or be one of a certain set of legal values. There
are scores of tack-ons trying to fix this and other shortcomings turning the simple XML into a tower of Babel.
- You can’t use the XML files directly, they need to be parsed first. Perhaps some day there will be pre-parsed,
compact, computer-friendly versions of XML. I have heard rumour such a beast called XMLC has been proposed.
- It uses HTML’s fluffy system of entities such as
- There are a raft of recommendations surrounding XML, such as XPath, XPointer, XSL, CSS, XLink and so forth. In the
pipeline are XHTML, Metadata and Namespaces and a Schema system. XML is fast becoming very complicated, because it is
not really standalone. You need added extras to make it usable. Competing standards will have to fight it out. The #1
reason XML caught on was its raging-idiot simplicity. Now it has not even that advantage.
- XML advocates say "Memory is cheap and bandwidth is cheap, so what the hell, let’s squander it." However,
this is not true with handhelds. Memory consumes battery power, the main limit today of handheld capabilities. Bandwidth
consumes radio air time and battery time. We are running out of broadcast frequencies. You can’t manufacture more
of them once the channels are filled, just use them more efficiently. Further, the delays caused by bloated XML packets
consume precious people time, and frustrate the heck out of users completely needlessly.
- In an Applet or a hand held device, memory for data and code is at a premium. You normally carefully massage the data
off-line to be as predigested and as compact as possible, e.g. serialised objects. As well as being fat, XML needs
considerable processing before it can be used. This consumes RAM for both data and code, and battery power to do the
massaging.
- There is no standard way to compress XML. You can use ZIP which is very cpu and ram heavy. You can use WBXML (Wireless
Binary XML). The problem is on receipt, it is fluffed back up to regular XML then
parsed, so it is has even more parsing overhead that regular XML. There are other compressed formats ASN-1
and WML. In practice most XML gets sent in its outrageously fluffy default form. People think XML files are always tiny
little 1K configuration files and so why worry. The point is once a format gets established, it gets used for all sorts
of things the originators would never have dreamed of, like 3 gig image files. ASN.1 schemas now can be used to validate
XML files. XML files with XML schemas can be automatically converted to ASN.1. ASN.1 files can be decoded 100
times faster than XML. I think it is time to start thinking of using ASN.1 instead of XML for large files, or for when
they must me transported over the wire.
- There is sort of mania to convert everything to XML, even things for which it is only marginally well-suited.
“This obsession of XMLing everything (build scripts, database mapping, setup & configuration,… etc.)
without proper GUI tools to intelligently and efficiently edit and maintain such data contradicts the very
fundamental role of the programmers’ profession.”
~ Hani Hammami
- You pay for forcing all data into the XML mould in the circumlocutions necessary to say everything in XML, e.g.
about 8 lines of code to conditionally copy a file in ANT with XML.
- XML assumes all data in the universe come in the form of a tree. XML becomes a Procrustean bed if the data are not tree-structured.
- XML DTD uses a ugly syntax with gratuitous punctuation. #IMPLIED really means optional.
#PCDATA means string <!ATTLIST means attributes.
- There are no standard tag names for XML. Everyone still codes postal addresses differently which means data exchange
still requires custom coding. RDF ontologies address this problem.
 |
recommend book⇒The Theory of The Leisure Class |
| | paperback | hardcover |
|---|
| ISBN13: | 978-0-14-018795-3 | 978-0-8488-1659-9 |
|---|
| ISBN10: | 0-14-018795-2 | 0-8488-1659-5 |
|---|
| publisher: | Penguin |
| published: | 1994-02-01 |
| by: | Thorstein Veblen |
| This is one of the most amusing books I ever read. It is funny by being so on. He coined the terms conspicuous consumption and conspicous waste to explain modern status displays. |
|
XML is an example of
conspicuous waste, waste for waste’s sake. I find it morally repugnant. I reminds me of Roman Emperor Caligula who
took a bite of a peach, tossed it away, then grabbed a fresh one. The authors went out of their way to create a bloated,
ugly syntax.
Using XML to transmit data is the analog of insisting that all code be passed around as triple spaced Java source files,
with added dummy comments, rather than as binary byte code. There is no guarantee a source file is even syntactically
correct. It is impossible to create a syntactically incorrect byte code file. Byte code files can be processed without
time-consuming parsing. In byte code, repeating strings are naturally specified only once. XML, as it stands, suffers
from all those analogous drawbacks and more.
What Should Replace XML?
The characteristics include:
- It needs to be a binary format for compactness. Files have to both be transmitted and stored. Size does matter. People
think in terms of one page XML files, but they potentially could be gigabytes long. If XML becomes an established
interchange format we will pay for the slop in XML trillions of times over. It is not good enough to say XML files will
always be stored in compressed form. In my experience in practice XML files are never compressed. Files should be both
compact and quick to process. XML as it stands is neither.
- It needs to be a binary format to ensure correctness. Human readable formats tempt people to manually compose documents
that are almost syntactically correct, e.g. HTML. This is too sloppy for an interchange format. Consider how much better
chance you have of getting a working program first time if someone sends you java byte code rather than Java source that
may not even compile.
- It needs to be computer-friendly so that a program can rapidly find the data it wants without having to parse for
delimiters of various flavours. If people want to examine the file detail for debugging, let them use a binary reader/editor.
You could use counted strings rather than delimited strings and use integers to encode the field types so they can be
used directly as table indexes. I would not go quite so far is to ask for a serialised tree of nodes, but push for a
representation that can rapidly be turned into one.
- For giant files, the representation should not have substantially more overhead than the raw binary. There need to be
ways of efficiently expressing repeating patterns. For example, there is no need for delimiters for fixed length data.
There is no need for individual field identifiers for standard groupings of fields. You want to push as much as possible
of the file format description into the descriptor file, out of the data file. The descriptor file need be transmitted
only once. The data file will typically be transmitted again and again. There is no need to make the format simple, just
compact and fast to process. All you need is a simple programmer’s interface to it. Only a handful of
programmers ever need concern themselves with its inner structure.
- XML currently only allows for hierarchical trees of data. There are one or two other types of data out there in the
world, (e.g. tables, relations, references, graphs) A universal interchange format should be a little more flexible. If
it is worth doing, it is worth doing right. Obviously the format can’t be expected to handle every conceivable
data structure and obsolete every specialised interchange format ever devised. However, XML is talking big about
becoming universal and should deliver. It can’t even handle ordinary business data which is typically relational
not strictly hierarchical.
- One possible example of the sort of inner structure I am thinking of is my HTML
compactor project.
- The other thing it needs is in the DTD some information about the allowed data types, there need to be the usual bounded
ints, IEEE floats, IEEE doubles, 8-bit encoded strings in some reasonably small number of character sets, with maximum
and minimum lengths, as well as a variety of business types, such as zip, zip+4, state, country, Canusan phone,
international phone, date, time, credit card number, latitude, longitude, etc. When someone is handing you data you need
to know how clean it is. You need to know ahead of time the minimum and maximum enforced limits on various field sizes.
- Ideally the new binary format, or a variant of it would also handle the function HTML does now. This would, in a stroke,
give four benefits:
- Much more compact transmissions, which means much faster transmissions and lighter loaded servers.
- No more syntax errors. In the process of converting to binary format all syntax would either have to be manually or
automatically corrected. This means the browser no longer has to deal with both the official standard, and also all the
common variant errors that people type. This means pages would always render properly. As it is, pages render properly
only in the browser used by the author which forgives his particular errors. The binary protocol effectively blocks
human HTML coding errors from getting out on the net.
- Faster rendering since the data would arrive already preparsed. The browser would know for example how big tables are
before it had finished reading the entire file, and so could start rendering the top part of the document accurately
immediately.
- Consider the total dollars invested in equipment in the world to transmit HTML, including servers, satellite links,
fibre optic links, cable connections… In a stroke, you would double the capacity of that equipment to deliver
HTML, simply by switching to a binary delivery format.
One possible candidate for the XML replacement job is the Java serialised object format. It can handle just about any
data structure imaginable. It is platform independent. It has a simple DTD — Java source code for the
corresponding class. Some claim it is Java-only. Not so. It is no more difficult for C++ to parse than any other similar
newly concocted protocol. It is not tied to any hardware or OS. It is just that Java has a head start implementing it.
Java can implement it with no extra overhead.
There have been some efforts made to patch up the shortcomings of XML, in fact there are dozens of them. XML is no
longer simple any more. It is raggedy patchwork quilt. People were sucked in by the initial simplicity, then discovered
that it was not really all that useful in its simple form. Schema was added to allow specifying types (but still only
permitting strings). Yes we need a standard interchange format, but XML was only a back of the envelope stab at it. XML
was destined to fail since it totally ignored so many factors in coming up with a good design.
One such effort is VTD Virtual Token Descriptor (VTD).
A VTD record is a 64-bit integer that encodes the starting offset, length, type and nesting depth of a token in an XML
document. Because VTD records don’t contain data fields, they work alongside of the original XML document, which
is maintained intact in memory by the processing model.
Due to the stupidity, duplicity and/or greed of those promoting XML, we will likely be stuck with some committee-patched
variant of it forever — something that will make even HTML look clean. We need a common data interchange format,
but not so inept.
DTD
You need to compose a DTD file that describes the format of the XML file. The <!ELEMENT
statement is used to list the various tags you will use, and which tags may be used inside which tags, and how often and
in which order. The <!ATTLIST statement is used to list the various attributes (mandatory
and optional) of each tag. The <!ENTITY statement lets you make up you own abbreviations.
Here is a simple example:
DTD:
<!ELEMENT square EMPTY>
<!ATTLIST square width CDATA "0">
The CDATA means the value of the field is a string.
XML:
<square width="100"></square>
Schema
A schema is a document that describes what constitutes a legitimate XML document. It might be very generic, describing
all XML documents, or some particular class of XML documents, say ones describing an invoice for the XYZ company. The
original XML schema was called DTD, borrowed from the HTML people. It was clumsy and did not allow very tight
specification. It basically just let you specify the names of the tags and attributes. Since then there have been
several other flavours of schema: RELAX NG, Schematron and a
new one from W3C called XML schema.
DTDs look nothing like XML itself. XML Schema is itself a flavour of XML. XML Schema is a major advance over DTD. It is
described in three documents: Primer, Structures
and Data Types. It can define
datatypes, ranges, enumerator, dates, complex datatypes to much more rigidly specify what constitutes a valid XML file.
Awkward Characters
XML has a similar problem to HTML with reserved characters. What if < incidentally
appears in your data? It would be look like the beginning of some </end> tag. There is
only one truly awkward character, namely <, and you deal with it the same way you do in
HTML, by encoding it as an entity reference, namely <. (They
are not called entities in XML since that term is already taken to mean a group of data.)
HTML has scores of entities whereas XML has only five:
< ( < ), & ( &
), > ( > ), " ( "
), ' ( ' ).
All of the entity references are optional except for < and &
But what about awkward non-ASCII characters such as é and Ω
and ⇔? There are six ways around the restriction that XML does not support the full set
of HTML character entity references.
- If you use UTF-8 encoding, you can use any Unicode characters plain without entification.
- If you use an 8-bit encoding such as ISO-8859-1, you can stick to just 256 characters
defined in that encoding.
- You could use decimal NCRs (Numeric Character Entites)
e.g. € for the euro sign €. Values of
numeric character references are interpreted as Unicode characters — no matter what encoding you use for your
document. To be perverse, you could use decimal numeric entity references or the basic entity references i.e.
< ( < ), & ( &
), > ( > ), "
( " ), ' ( '
).
- You could write a DTD to create the additional alphabetic character entities references you need, e.g. €
- You could use hexadecimal NCRs (Numeric Character Entites)
e.g. € for the euro sign €. Again the values
of numeric character references are interpreted as Unicode characters — no matter what encoding you use for your
document.
- If you take a depraved pleasure in deformity, you could use the CDATA sandwich. Place pretty
well whatever data you want, including raw (un-entified) <, >
and &, within in a bizarre sandwich of characters namely: <![CDATA[
… ]]>
e.g. <caption><![CDATA[Rah! <><><>
Rah! & all that.]]></caption>
Handling awkward characters is a concern if:
- You compose XML “by hand“ with a text editor.
- You are developing code and read XML files directly.
- You write code to generate XML directly without using any sort of XML package.
Otherwise, the XML package will transparently handle awkward characters for you both on writing and reading, so you can
forget about them.
UTF-8 files using the basic five character-entity encodings, or ISO-8859-1,
with the basic five character entities (possibly excluding ') plus decimal NCRs,
will create the files easiest to read and compose manually, XML’s saving grace.
XML Serialization
There is another form of serialization that produces XML instead of binary ObjectOutputStreams.
It uses the java.beans.XMLEncoder class. It does not use the Serializable
interface, but writes ordinary Objects that have JavaBean-style getter and setter methods
and a no-arg constructor. It does not persist fields, but rather properties (in the Delphi sense, not System.
setProperty), implemented with get/set. Basically it looks for all the get
XXX methods, and calls them, and emits a stream of tags named after the properties. To reconstitute, XMLDecoder
instantiates an Object of the class, and calls the corresponding setXXX
methods from the values in the XML stream. The source and target classes need not have matching code the way they do
with true serialization. Most trouble using this features comes from thinking it behaves like ordinary serialization.
They have almost nothing in common.
Digitally Signing XML
You would think XML would be a nightmare for digital signing, with its variable amounts of whitespace, and variable
newline characters and lax attitude toward the encoding. However, W3C
has invented a slick scheme to let you digitally sign various fields in an XML document (by specifying #xxxx
HTML-like targets) and embed the signature in the document. You can also sign documents external to the XML file. The
secret is canonicalisation. You use an
algorithm to tidy the document to standard form. The transforms leave embedded, lead and trailing whitespace on fields
intact, but collapse the rest to standard patterns. The scheme allows for various canonicalisation transforms and
various signing algorithms. As you would expect from XML, the signature block is gargantuan.
Apache has written classes to make the work easier.
Books
 |
recommend book⇒Java and XML |
| | paperback |
|---|
| ISBN13: | 978-0-596-10149-7 |
|---|
| ISBN10: | 0-596-10149-X |
|---|
| publisher: | O’Reilly  |
| published: | 2006-12-08 |
| by: | Brett McLaughlin, Justin Edelson |
| Covers SAX2, DTDs, XML Schema, XSL, JDOM, JAXP, JAXB, RSS and remote procedure calls with XML. |
|
Learning More
Sun’s Javadoc on the
Schema class : available:
Sun’s Javadoc on the
SchemaFactory class : available:
Sun’s Javadoc on the
Validator class : available:
Sun’s Javadoc on the
XMLConstants class : available:
Sun’s Javadoc on the
SAXParser class : available:
Sun’s Javadoc on the
XMLEncoder class : available: