Saturday, December 10, 2005

Foundations of XML: Validating XML Documents (1)

Apologies for my long silence. Things got unexpectedly hectic for me in the past two to three weeks for various reasons. Anyway, as I promised in my previous post, I will further talk about foundations of XML, should the reader be interested. I received two comments expressing interests, one from Jon Cohen, which assure me that at least somebody is going to read this post :-). So, on with our discussion on foundations of XML. Let's spend some time now talking about validation of XML documents. [I am still preparing my post on navigating XML documents and XPath.] In this post, I will talk about a practical validation language that goes by the name of DTD (Document Type Definition).

Roughly speaking, the aim of validation is to ensure that an XML document "makes sense". More precisely, the problem is that, given an XML document (i.e. a labeled unranked tree T) and a statement f in some specification language L, we want to test whether T satisfies f. The most frequently used language L for validation of XML documents is DTD (Document Type Definition). Here, we use a formal model of DTDs that does not incorporate XML "data values", e.g., texts within XML tags (see my previous post). Let us start with an example. As always, we fix a finite alphabet Σ of discourse. Continuing with our example in the previous post, suppose we want to test whether an unranked tree T satisfies the properties:

  1. its root is labeled "books",
  2. every node labeled "books" may have zero or more children labeled "book", and
  3. every node labeled "book" must have four children labeled "title", "author", "year", and "publisher" (in this order), optionally followed by one child labeled "rank".

A DTD that asserts this property is the following:

books -> book*
book -> title, author, year, publisher, rank?
title -> ε
author -> ε
year -> ε
publisher -> ε
rank -> ε

If you have studied regular languages, regular expressions, and context-free grammars, then chances are that you wouldn't have any problem understanding what the previous DTD says. Loosely, each rule (i.e. V -> e, where V is a "label" in Σ and e is a regular expression over Σ) specifies that a V-labeled node in an unranked tree T has children with labels c1, ..., cn that can be generated by e. The Kleene star '*' stands for "zero or more". The question mark '?' abbreviates "zero or one". Commas ',' mean "followed by" in the sense of string concatenation. Finally, the symbol 'ε' is the empty string. Therefore, formally, a DTD over Σ is a pair (r,P) where r is the root symbol in Σ and P is a set of rules over Σ such that r does not appear in the "body" of any rules in P (i.e. on the right hand side).

Although DTD is a nice practical specification language for XML, it does not satisfy some nice properties of both practical and theoretical interests. For example, the sets of trees that can be recognized by DTDs are not closed under unions and complementations. [Recall that regular languages are closed under unions, intersection, and complementation (see here for more closure properties of regular languages).] Also, DTDs do not recognize "contexts" (e.g. how do you create a DTD that uses the tag "name" to mean two different things such as "name" of a "book", which should have no children, and "name" of a "person", which should have exactly two children labeled "First Name" and "Last Name"?). Nonetheless, this problem does not seem to bother programmers that much. This explains why DTD is still the most popular XML validation language. However, there are some XML validation languages that address the aforementioned problems. I believe the most popular such language is XSD (XML Schema Definition). We shall consider a formal model of XSD next. We shall later see that XSD has the full power of the yardstick logic for XML validation, which is monadic second-order logic.

4 comments:

Anonymous said...

Very nice post. I would only add that XSD also allows the content type of a node to be defined such as string, date, integer, etc... which DTD doesn't. Also other things such as more evolved cardinality.

Thanks for posting. Keep it up mate!

anthonywlin said...

Thanks for your kind words :-)

I'll try :-)

Anonymous said...

Hi Anthony,

This is off-topic for this post, but I just got the new issue of SIGACT News in the mail, and I noticed that it features an article by Neil Immerman giving an overview of current directions in descriptive complexity.

anthonywlin said...

Hi Kurt,

Thanks for letting me know about Immerman's article in the latest issue of SIGACT Newsletter. I will try to review in the next few days.