The state of XML compression

A recent discussion thread on the xml-dev mailing list once again brought up the issue of binary XML. With the emphasis on programming for mobile devices these days, XML's bulkiness is once again becoming an issue. As a past participant of the W3C's Efficient XML Interchange working group, we can offer some insights into some available options for this, and comment on their status.

The first is good old ASN.1. Its "problem" in the XML world is that it requires schemas to be used. If XML schemas are available though, an ASN.1 standard (X.694) exists that provides a standardized way of converting an XML schema to ASN.1. Once this is done, ASN.1's binary encoding rules are available to produce more efficient encodings. These include BER/DER for producing a relatively simple binary representation or the more complicated Packed Encoding Rules (PER) for producing a more compact representation.

If schemas aren't available, the ITU-T Fast Infoset Standard (which has been available for several years) and EXI (which is still at candidate recommendation stage) are options. The problem is, without schemas, neither of these does a particularly good job in general of compressing XML data. Both get most of their gains from compressing large documents containing lots of repeating data (in particular, XML element tags) because they remember the string patterns that were previously parsed and substitute compact identifiers. This works out to be about on par with what can be achieved using standard gzip compression, but does have some advantage in terms of streamability.

EXI does have a schema-informed mode and if tight schemas are in place, it can deliver impressive results. This comes at a cost of considerable complexity, however, meaning a fairly large codebase is required to fully support the recommendation. This may make it unsuitable for small, memory-constrained devices, which are the ones that would benefit most.

And then there is JSON. A nice, simple, reasonably compact textual format that appears to be replacing the use of XML in a lot of domains. Does it solve the bulkiness issue? It does not have the order of magnitude advantage the W3C set out to achieve, but it does provide a nice size advantage over XML simply by eliminating redundant start and end tags alone. Is this good enough for most applications, maybe with the use of gzip when needed? We tend to think so.

The state of XML compression

Published

Category

Tags