Using XML As File Format

In my last post, I talked about some persistence issues we are working through and how persisting to XML could make things easier for us. In this post, I want to talk about using XML as a file format. Assuming we switch to an XML based file format for our products, there are questions I have regarding the proper use of XML.

Looking around, I can see at least 2 different categories:

  1. XML file is a dump of your in-memory state at the time of saving.
  2. XML file is a system unto itself and should be designed as a standalone deliverable of the product.

If you read my last post, you could guess the system we built for persisting objects falls clearly in to category #1. If you were to view the resultant XML, I doubt you’d be overwhelmed by the beauty of its structure. The XML is not being used to its full potential. No namespaces, no XPointer or XInclude. Just a raw dump of state stored in XML elements, with very little use of attributes.

On the other hand, if you were to look at the XML created by an MS Office product you would find a decent looking XML document. Namespaces are used and the XML elements don’t appear to map directly to the product’s underlying object model.

Want a better example? Look at a document file created using OpenOffice. These guys really cared about the XML file structure and use many XML standards (spec). In fact, there isn’t just a single XML file. They create a ZIP archive and place several XML files in the archive. They also include a manifest file to describe the other files contained in the archive. Whenever possible, they use or extend existing XML based specs to implement their format. Clearly not a direct mapping to their object models.

How important is the XML structure when using XML as a native file format? We could always suggest people use XSLT to change our XML to suit their needs. That sounds reasonable, right?

Persisting Can be Painful

We are considering changing from an OLE Structured Storage based file format to an XML based file format our flagship product. OLE Structured Storage has served us well, but we are starting to hit walls, especially with versioning and backward/forward compatibility. Yes, forward compatibility!

The product, mainly C++, has objects that save and load their state using a storage system. The storage system is IStorage/IStream and all of the objects know it. That’s a minor problem that can be fixed. The bigger problem is the fact that the object hierarchy defines the structure of the file stream. That means any change to the object hierarchy changes the structure of the file stream. Or, any significant change in the object model (version 1.0 to version 2.0) makes it very painful for v2.0 to read v1.0 files. Why the headache? Since each object is responsible for saving/loading is own piece, an object in v2.0 may need to remember what it looked like in v1.0. Even worse, what if an object in v1.0 no longer exists in v2.0? It can get pretty ugly.

A different product, still in development, uses a system that abstracts way the “how” of the persistent storage. The objects that make up the business logic can save and load their state using an abstract storage interface. It has implementations for persisting to binary streams, SQL databases and XML streams. Its not revolutionary, COM’s IPersist has allowed developers to do this kind of thing for years. However, we find ourselves primarily using the XML storage more than the others. In fact, we use it for more than persisting to files. We use it for copying state to clipboard and maintaining our Undo/Redo system as well. But I am getting ahead of myself.

I was reading a post by Chris Pratley where he talks about Word, RTF and XML:

RTF is used for this purpose instead since it is easier to deal with than Word binary for apps other than Word (remember that is why we created it – it stands for Rich Text interchange Format). The new XML format is designed for exactly that purpose – and it is easier to work with than RTF. You can create the WordML doc (or even a minimal subset) on a server using XML tools, then send the XML to Word on the client and Word will load it up. If you’re missing a lot of the Word specific stuff, that’s OK – Word will fill in the missing bits with defaults. In fact, you can skip generating the doc on the server if you want – just generate an XML data file in your own schema and provide an XSLT for Word to use when opening the file. That pushes a lot of the processing onto the client.

An idea for how to get past some of our versioning issues started to form:

  1. What if our objects persisted to XML?
  2. What if each version only cared about the structure of its XML?
  3. What if we used XSLT or XQuery to convert one version’s XML to another?

In my mind, forward compatibility (an old version opening a newer version’s file) would be hard. Backward compatibility would be easier. Just transform the old XML to the current version’s structure and load it. We could allow newer versions to save files in an older version’s XML structure. That would be one way of handling forward compatibility. We are researching this concept to see where it can take us.