Canonicalizing an XML file

The canonical form of an XML document is a normalized version of that XML document. Two XML documents that are physically different can still be logically the same. For example, consider an XML tag with two attributes. The order in which the attributes appear is of no importance:

(1)
<tag attrA="123" attrB="456"/>
  
is logically the same to 
 
(2) 
<tag attrB="456" attrA="123"/>

The canonical form of an XML document is important when you look at signing. Signing an XML document consists of calculating a message digest (hash) to ensure message integrity and signing the message and the hash with the private key of the sender. The receiver would then use the public key to verify.

The verification procedure should go successful regardless of the physical representation of the XML document. This is where the problem comes in: the digest of example (1) is different than the digest of example (2), even though the information is the same.

It is important to calculate the message digest on the canonical form of the XML document.

More information about canonical XML can be found here.

The following example uses the Canonicalizer class from the XML Security project at apache.org. Download it here and place the following libraries in your classpath:

bc-jce-jdk13-118.jar   (bouncycastle library)
log4j-1.2.5.jar
xalan.jar
xercesImpl.jar
xml-apis.jar
xmlsec.jar

Main.java:

import org.apache.xml.security.c14n.Canonicalizer;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.*;
 
import java.io.ByteArrayInputStream;
 
public class Main
{
   static String input1 = "<doc><field1 attr1="123" attr2="456"/><field2>abc</field2></doc>";
   static String input2 = "<doc><field1  attr1="123" attr2="456"   /><field2   >abc</field2></doc>";
   static String input3 = "<doc><field1 attr2="456" attr1="123"  /><field2>abc</field2></doc >";
 
   public static void main(String args[]) throws Exception 
   {
      org.apache.xml.security.Init.init();
 
      DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
      docFactory.setNamespaceAware(true);
      docFactory.setValidating(true);
 
      DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
 
      // This is to throw away all validation warnings
      docBuilder.setErrorHandler(new org.apache.xml.security.utils.IgnoreAllErrorHandler());
 
      System.out.println("Input:");
      System.out.println("  " + input1);
      System.out.println("  " + input2);
      System.out.println("  " + input3);
 
      byte[] output1 = canonicalize(docBuilder, input1);
      byte[] output2 = canonicalize(docBuilder, input2);
      byte[] output3 = canonicalize(docBuilder, input3);
 
      System.out.println("nOutput:");
      System.out.println("  " + new String(output1));
      System.out.println("  " + new String(output2));
      System.out.println("  " + new String(output3));
   }
 
   public static byte[] canonicalize(DocumentBuilder docBuilder, String input) throws Exception {
      byte inputBytes[] = input.getBytes();
      Document doc = docBuilder.parse(new ByteArrayInputStream(inputBytes));
      
      Canonicalizer c14n = Canonicalizer.getInstance(
        "http://www.w3.org/TR/2001/REC-xml-c14n-20010315#WithComments");
 
      return c14n.canonicalizeSubtree(doc);
   }
}

outputs:

Input:
  <doc><field1 attr1="123" attr2="456"/><field2>abc</field2></doc>
  <doc><field1  attr1="123" attr2="456"   /><field2   >abc</field2></doc>
  <doc><field1 attr2="456" attr1="123"  /><field2>abc</field2></doc >

Output:
  <doc><field1 attr1="123" attr2="456"></field1><field2>abc</field2></doc>
  <doc><field1 attr1="123" attr2="456"></field1><field2>abc</field2></doc>
  <doc><field1 attr1="123" attr2="456"></field1><field2>abc</field2></doc>