Roughly speaking, the aim of validation is to ensure that an XML document "makes sense". More precisely, the problem is that, given an XML document (i.e. a labeled unranked tree T) and a statement f in some specification language L, we want to test whether T satisfies f. The most frequently used language L for validation of XML documents is DTD (Document Type Definition). Here, we use a formal model of DTDs that does not incorporate XML "data values", e.g., texts within XML tags (see my previous post). Let us start with an example. As always, we fix a finite alphabet Σ of discourse. Continuing with our example in the previous post, suppose we want to test whether an unranked tree T satisfies the properties:
- its root is labeled "books",
- every node labeled "books" may have zero or more children labeled "book", and
- every node labeled "book" must have four children labeled "title", "author", "year", and "publisher" (in this order), optionally followed by one child labeled "rank".
A DTD that asserts this property is the following:
books -> book*
book -> title, author, year, publisher, rank?
title -> ε
author -> ε
year -> ε
publisher -> ε
rank -> ε
If you have studied regular languages, regular expressions, and context-free grammars, then chances are that you wouldn't have any problem understanding what the previous DTD says. Loosely, each rule (i.e. V -> e, where V is a "label" in Σ and e is a regular expression over Σ) specifies that a V-labeled node in an unranked tree T has children with labels c1, ..., cn that can be generated by e. The Kleene star '*' stands for "zero or more". The question mark '?' abbreviates "zero or one". Commas ',' mean "followed by" in the sense of string concatenation. Finally, the symbol 'ε' is the empty string. Therefore, formally, a DTD over Σ is a pair (r,P) where r is the root symbol in Σ and P is a set of rules over Σ such that r does not appear in the "body" of any rules in P (i.e. on the right hand side).
Although DTD is a nice practical specification language for XML, it does not satisfy some nice properties of both practical and theoretical interests. For example, the sets of trees that can be recognized by DTDs are not closed under unions and complementations. [Recall that regular languages are closed under unions, intersection, and complementation (see here for more closure properties of regular languages).] Also, DTDs do not recognize "contexts" (e.g. how do you create a DTD that uses the tag "name" to mean two different things such as "name" of a "book", which should have no children, and "name" of a "person", which should have exactly two children labeled "First Name" and "Last Name"?). Nonetheless, this problem does not seem to bother programmers that much. This explains why DTD is still the most popular XML validation language. However, there are some XML validation languages that address the aforementioned problems. I believe the most popular such language is XSD (XML Schema Definition). We shall consider a formal model of XSD next. We shall later see that XSD has the full power of the yardstick logic for XML validation, which is monadic second-order logic.