XML 1.1 long attributes in Java 6

XML 1.1 is not as widely adopted as XML 1.0 and is recommended for use only by those who need its unique features. I was writing a Java/EMF program that was storing serialised objects in XML. The serialisation involved control characters, e.g. , so XML 1.1 seemed a good choice.

To my surprise I could not parse some documents that I have just serialised. Some attributes went missing and then EMF was failing to load the objects. To check whether this is my problem I tried finding a minimal example to reproduce the problem. The rest of this post gives the example in detail, but to cut things short, it looks like there is a bug in Java 6 XML libraries: the problem is not reproducible with the latest version of Apache Xerces libraries.

Parsing long attributes of XML 1.1

I created an example that parses an XML 1.1 file with long String attributes using Java 6 SAX parser. If an element has another attribute (usually shorter), it is resolved erroneously. It only happens for XML 1.1, while XML 1.0 works correctly.

The Java example below reproduces the problem. It generates an XML file and parses it back using the SAX parser, then prints out the attributes:

When the above example is run, I get the following attribute values:

The size of the long attribute (8175 here) is set to exhibit the problematic result. If the long string is shorter (e.g. 8150), the value is resolved correctly: Attr: target, Value: targetAttr. For longer strings, the attribute value seems to slide in the long string’s buffer.

This issue appears when the file is indicated to be XML 1.1. If changed to XML 1.0, the problem disappears (even for way longer strings) and the attribute value is resolved correctly.

The problem is not limited to Java SAX parser: the DOM parser also suffers from the same issue. In my case, it appeared when using Java 1.6.0_29 on Mac OS X Lion.

Finding the solution

The text above is an almost exact copy of the question I was prepared to post on StackOverflow. I could not find an answer online and was worried that I may be using the serialisation wrongly; or that some hidden attributes needed to be set on the parser.

Before posting on StackOverflow, they make you jump through several hoops to ensure that this is indeed an unanswered question. The approach is very nicely explained by one of the creators as “Rubber duck problem solving”. In one of the suggested “related questions”, I found a suggestion that Java usually bundles old XML parser libraries, which may have bugs.

After downloading latest Apache Xerces libraries, the problem disappeared. So to use the long attributes, I should bundle the additional libraries in my program. Eventually I chose an alternative: to work with a different serialisation, which avoids the control characters. This allowed me to use XML 1.0, which does not exhibit the problems.