Monday, April 25, 2011

XML file comparison

Another problem is to compare two XML files with the same structure and bring out a result that spells out the differences in the two files. Again I am set out to talk on a solution more involved than the simple one we actually chose to implement.

Simple approach :
A simple approach is to start by referring to the topmost parent node in the code and try to find it in both the XML files. If the node is found in both the files, then no difference found at the first level and we can proceed to the next deeper level of nodes to query. The actual XML tags are thus hard-coded and the same process can be repeated. If it is not found in either of the files, then the first difference is found. We can note this difference and then generate some other structure to describe the difference or a "delta". We can call this as the difference structure. If the XML is a representation of a relational database then this difference structure can then be used to translate into a set of RDBMS queries that equalise the database contents. Thus it is quite plausible that the difference structure is a linear structure rather than a tree-like XML.

Drawbacks of the simple approach :
The logic to parse the structure is inextricably mixed up with the logic to generate a difference structure that is a linear one. If the XML structure changes, the parsing logic also changes and the changes in the delta generated also follow. What would be ideal is the parsing of the nodes be independent of what the XML is about and just the actual difference structure (and subsequent RDBMS queries) generated be dependent on the actual differences in the XML. Its the changes needed to the parsing code that is a matter of contention. I think we can do better.

A better approach :
As it usually happens in computer algorithms, the better approach is more complex than the simple ones. As an aside, the situation is quite different when dealing with mathematical statements and proofs - the better proofs are shorter and have an "elegance" quality in them, in the long run, they prove to be more readily intuitive.
Ok, here we should let the coding logic be independent of any actual XML tags. We actually create a nice array of strings that is multi-dimensional and "hand-write" the XML tags in them. So If the XML has a hierarchy that goes two levels deep, then the array will be actually a list of a list of strings. We just create a string structure that mirrors the XML. This looks tedious but is way simpler than coding that structure for parsing into something the compiler finds acceptable.
The parsing code traverses this string array and treats it as as a descriptor to the XML. It does the same for both the files and if it now finds a difference, it will proceed to generate a in-memory linear difference structure. A separate piece of code will translate the differences into something like a set of RDBMS queries as required and the important fact is that this query generator is strictly different from the parser.
If the XML structure changes (or more likely, extended), no change is needed to the parsing logic. Just the hand-written string array should be changed to reflect the changes. Some skill involved in coding the parsing logic but once done, this approach scores over the earlier one on counts of maintainability and extensibility.

I see parallels in this approach and this is one that could be described as a "data-driven" approach (my term, my quotes). The actual code is only like a markup processor, that is very generic and simplfied. The key point is not that compilation is avoided but that the changes are not in code but in an array-like stucture of simple strings or values.

XML specification and duplicate tag processing

XML has been long touted as a very promising method for information exchange. Some count it as too verbose and doubt how efficient XML turns out to be if the information is voluminous. However, XML still reigns as the most widely accepted method to convey structured data, in a human readable form, for which parsers are widely available and one that is extensible.

One pattern of usage was noticed at my work product : Referring to another tag, to copy content -

A huge XML file that carries product control configuration of the entire application is usually being edited by humans. It basically stores configuration properties for various services that run as part of the product. What should we do if there are multiple duplicate services and they essentially have the identical properties ?

For example -

<Top_Parent_Node attr1="val1">
<Service_Node attr2="val21">
<Prop attr3="val3">
....
... Complex set of enclosed tags ....
....
</Prop>
</Service_Node>

<Service_Node attr2="val22"> <!-- duplicated service tag : we need this for the application -->
<Prop attr3="val3"> <!-- Forced to repeat this from the previous tag -->
....
... Complex set of enclosed tags ....
....
</Prop>
</Service_Node>
.... More such repetitions ....
</Top_Parent_Node>

The simplest way is to repeat the properties at both locations by copy-paste. We are rather good at that.
We, however, screw up miserably when it comes to propagating changes to one set of properties to all other identical locations.

I have a suspicion that this is a common situation that others run into as well. Which makes a good case for formalising this requirement in the XML specifications itself. The XML specification should allow a choice - either specify tags or make a reference to other tag that will be as good as copied into this tag while parsing.
For example -

<Top_Parent_Node attr1=val1>
<Service_Node attr2=val21 ?xmlref="N1" > <!-- Label this tag as a reference -->
<Prop attr3=val3>
....
... Complex set of enclosed tags ....
....
</Prop>
</Service_Node>

<Service_Node attr2=val22> <!-- duplicated service tag : we need this for the application -->
<?xmlref="N1" /> <!-- No need to repeat - referred label is treated as copied -->
</Service_Node>
.... More such repetitions ....
</Top_Parent_Node>

Few points to note :
- Only one place where entire spec of a node that will possibly duplicate resides.
- Any changes made to one place will reflect in all other places which refer to it.
- The first Service_Node, that carries the complete spec is labelled in a unique manner. This label is part of the specification and any node can be labelled in this manner. Thus it need not appear in any dtds or xsls as an available attribute.
- Any node can refer to this label by enclosing a <?xmlref> with a label identifier. The parser should copy the entire specification within the referred node into this node.
- The referring node and the referred node need not be in the same hierarchy or tree depth. The parser should deal with a referring node appearing before the referred node in the file. This is to keep the XML parsing independent of ordering. If the referred node is not found, the parser should throw an exception. I can see that DOM parsers can handle this in a straightforward manner. The SAX however should need to parse to the end in search of a referred node.
- I don't quite see the possibility to provide partial overriding capability to this idea without unnecessarily complicating the idea and obfuscating the XML specification.
- The fact that integrity is maintained easily with changes to the spec gives some credence and value to this idea over the fact that readability of the XML is somewhat hampered.

When I encountered this problem at my workplace, I must say that the problem was solved at the application layer, i.e- a new tag was added inside the duplicates to refer to the other node. It was a simple hack to the problem but it seems not to solve the problem but rather work around it. As you would have known, this is what happens in a commercial context under time pressure.