Friday, August 24, 2012

.NET Regular Expressions and XML NAME tokens


Hello everyone. I'm back again. Before we get started, I just kind of wanted to put out a notice. I'm working on my blog template. So if you come here and see that stuff doesn't look exactly the way it should, well, you've been warned. This includes the mobile template (so far, I'm not impressed with blogger's mobile templates).

Wow, so many articles in such a short amount of time! Yes, I'm being facetious ;). Let's hope I can continue this pattern. Afterall, I started this blog 3 years ago to start writing down things I learn and which I'd like not to forget.

Well, this blog post certainly fits the bill. I know I've done some research on this once before, but I failed to write it down. So I wasted some more time re-researching this problem. What a waste. So, to help you not waste your time, I hope that you find this blog post helpful. Enough already, let's dive in!

Extensible Markup Language NAME Tokens

I first learned eXtensible Markup Language (XML) back in 2000-something-or-other and haven't used it much. A lot of people use XML everywhere for anything. I'm not a big believer in that, actually. It's a great tool and it makes sense to use it where and when it needs to be used.

Having said that, there have been 3 occassions over the last eight months where I had to brush up on my XML skills and really know it well. I'll put a shameless plug in for a book that, while old, has helped me tremendously (I haven't seen its equal): XML Primer Plus by Nicholas Chase.

Like I said, the book is a bit dated now, but it still has very valuable information in it and not much has changed. Having said that, one thing that has changed are the allowable characters in NMTOKENs. I recently needed to validate values I read from user input to ensure that they were in fact valid an id attribute values. I naïvely assumed the following regular expression:

if ( ... && Regex.IsMatch(id, @"^\\d.*|\\w*[\\p{P}-[_]]+.*$") { ... }

So, basically, a XML NMTOKEN can have anything in it except punctuation characters (except for the underscore). I just realized as I'm typing this that I completely forgot to check if the NMTOKEN started with a number, which also isn't valid.

This might have been fine back in 1999, but the XML standard has changed since then. It's still version 1.0 (well, there is a version 1.1, but let's not go there), but it's the 5th edition.

The current XML specification defines a NAME token (to which ID tokens must adhere and which is a specialized form of a NMTOKEN as follows:

[Definition: A Name is an Nmtoken with a restricted set of initial characters.] Disallowed initial characters for Names include digits, diacritics, the full stop and the hyphen.

Names beginning with the string "xml", or with any string which would match (('X'|'x') ('M'|'m') ('L'|'l')), are reserved for standardization in this or future versions of this specification.

Note:

The Namespaces in XML Recommendation [XML Names] assigns a meaning to names containing colon characters. Therefore, authors should not use the colon in XML names except for namespace purposes, but XML processors must accept the colon as a name character.

The first character of a Name MUST be a NameStartChar, and any other characters MUST be NameChars; this mechanism is used to prevent names from beginning with European (ASCII) digits or with basic combining characters. Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names. See J Suggestions for XML Names for suggestions on the creation of names.

Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name. The character #x037E, GREEK QUESTION MARK, is excluded because when normalized it becomes a semicolon, which could change the meaning of entity references.

Names and Tokens

RuleProduction
[4]
NameStartChar
::=
":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]
NameChar
::=
NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]
Name
::=
NameStartChar (NameChar)*

At this point, I thought, "No problem," I'll just use those productions in a Regular Expression while making sure the token doesn't start with xml or some form of that thereof and that'll be that. During unit testing, I discovered that out of 29 invalid sequences that I tried (which were all in the ASCII portion, so this unit test was not comprehensive in any way for the time being), only 7 were flagged. What's going on here?

Well, I read the MSDN documentation for .NET Regular Expressions. Here was the expression I used that was failing the unit test:

@"^(?i:(?!xml))(?inx:[a-z]|_|:|[\u00C0-\u00D6]|[\u00D8-\u00F6]|[\u00F8-\u02FF]|[\u0370-\u037D]|[\u037F-\u1FFF]|[\u200C-\u200D]|[\u2070-\u218F]|[\u2C00-\u2FEF]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD]|[\u10000-\uEFFFF])(?inx:[a-z]|_|:|[\u00C0-\u00D6]|[\u00D8-\u00F6]|[\u00F8-\u02FF]|[\u0370-\u037D]|[\u037F-\u1FFF]|[\u200C-\u200D]|[\u2070-\u218F]|[\u2C00-\u2FEF]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD]|[\u10000-\uEFFFF]|-|\.|\d|\u00B7|[\u0300-\u036F]|[\u203F-\u2040])*$"

Now, I must admit, .NET's Regular Expression syntax is a bit funky. Anyway, the (?i:(?!xml)) says if the string case-insensitively starts with XML, then it doesn't match. Otherwise, use the production rules as shown above from the W3C.

The catch is, .NET strings are encoded in UTF-16 (and therefore, the escape sequence \u can only be followed by 4 hexadecimal digits), but according to the production for a NameStartChar I need to detect some characters outside of this range (the [\u10000-\uEFFFF] character class inside of the Regular Expression). Now, I had done some research on internationalization and Unicode, but I never really had to pay much attention to it. I'm the one who's mostly using my software. However, this is not the case this time. This software may be used by firms doing banking all over the country and aronud the world. So I definitely need to check that the ID string I'm getting is valid. During my research, I came across Joel Spolsky's blog on Unicode and character encoding. This was a good start (and you should read this if you haven't already), but I needed to know more. Specifically, how can I encode a Unicode character that falls outside of the range 0 - 65535 into a 16-bit value?

UTF-16 Surrogate Pairs

The answer is UTF-16 surrogate pairs. I had heard of these, but I didn't know how to generate them. This article helped me out. In the UTF-16 character encoding, there are no UTF character code points defined in the range 0xD800 - 0xDFFF. This range is used in an algorithm to generate the UTF-16 surrogate pairs that represent Unicode character code points above 65535. The algorithm is really simple and is outlined below..

  1. Take the hex value of the Unicode character to encode as UTF-16 and subtract 0x10000 from it.
  2. Take the result from step 1 above and shift it right 10 bits (0xA).
  3. Take the result from step 2 and add 0xD800. This gives you the first surrogate of the surrogate pair.
  4. Again, taking the result from step 1, AND the value with 0x3FF to mask off the upper 10 bits.
  5. Add 0xDC00 to the result from step 4 above. This represents the second surrogate of the UTF-16 surrogate pair.

The resulting 16-bit surrogates from above properly encode Unicode character code points above 65535 in UTF-16. Let's run through a quick examlpe.

Enocding a Unicode Character Code Point Above 65535 into UTF-16

Let's start with the Unicode character code point U+18657. (I don't know what this character is, I just chose something at random.) Following our algorithm above:

  1. 0x18657 - 0x10000 = 0x8657
  2. 0x8657 >> 0xA = 0x21
  3. 0x21 + 0xD800 = 0xD821 (This is the first surrogate of the surrogate pair.)
  4. 0x8657 & 0x3FF = 0x257
  5. 0x257 + 0xDC00 = 0xDE57 (And this is the second surrogate of the surrogate pair.)

So, from the algorithm above, the Unicode character code point U+18657 can be encoded into UTF-16 using the surrogate pair U+D821 U+DE57.

Putting it All Together

Finally, we come to the end. I needed to replace the character class [\u10000-\uEFFFF] with a valid \u escape sequence construct. That involves calculating the range of the surrogate pairs for the character class. I used the algorithm above to calculate the range of surrogate pairs which results in the following pattern that should be used to replace the invalid character class: ([\uD800-\uDB7F][\uDC00-\uDFFF]). This pattern will match all UTF-16 encoded Unicode character code points between the range of U+10000 - U+EFFFF, which is exactly what we want. Here's the final Regular Expression that will validate an XML ID token:

@"^(?i:(?!xml))(?inx:[a-z]|_|:|[\u00C0-\u00D6]|[\u00D8-\u00F6]|[\u00F8-\u02FF]|[\u0370-\u037D]|[\u037F-\u1FFF]|[\u200C-\u200D]|[\u2070-\u218F]|[\u2C00-\u2FEF]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD]|([\uD800-\uDB7F][\uDC00-\uDFFF]))(?inx:[a-z]|_|:|[\u00C0-\u00D6]|[\u00D8-\u00F6]|[\u00F8-\u02FF]|[\u0370-\u037D]|[\u037F-\u1FFF]|[\u200C-\u200D]|[\u2070-\u218F]|[\u2C00-\u2FEF]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD]|([\uD800-\uDB7F][\UDC00-\uDFFF])|-|\.|\d|\u00B7|[\u0300-\u036F]|[\u203F-\u2040])*$"

A Few Closing Remarks

Firstly, the second and third Regular Expressions shown in this article could be simplified (e.g. getting rid of all the alternation (|) constructs between character classes and creating one big character class). Second, and most important, this probably isn't exactly what you should normally do. The productions given by the W3C are meant to be inclusive, as they noted for their justification of the productions. But it would probably be more efficient to write a Regular Expression that matches on what should not be present in a XML ID token (though, the Regular Expression would be almost as long and complex).

I did test my final Regular Expression as shown above, and it did pass (again, my unit test was not comprehensive with regards to characters outside the 7-bit ASCII character set). However, since I used the production shown in the XML specification, I have no reason to think that this Regular Expression would not let an invalid ID token "slip" through.

No comments:

Post a Comment