Monday, October 10, 2011

Remove Byte Order Mark (BOM) characters from XML in Java

Sometime, in some Unicode encoded XML files you might noticed three digits of non-visible byte in the starting of the file content. If you open the XML file in any XML editors you can see those characters in the files.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
or
<U+FEFF><?xml version="1.0" encoding="UTF-8" standalone="yes"?>

This ("") three digit characters called Byte Order Mark (BOM). Byte order is determined by a BOM. Following table summarizes some of the properties of each of the UTFs encoding.

Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE
Smallest code point 0000 0000 0000 0000 0000 0000 0000
Largest code point 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF
Code unit size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits
Byte order N/A <BOM> big-endian little-endian <BOM> big-endian little-endian
Fewest bytes per character 1 2 2 2 4 4 4
Most bytes per character 4 4 4 4 4 4 4

In Java, program should remove these three digit before handling the XML file in the XML parser. Otherwise, the program will end-up with the javax.xml.xpath.XPathExpressionException and the following error message.
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.

Program need to load the file content as a String and need to remove all the Bits in front of the "<?xml" element. Following code will remove all the unwanted contents from the XML file.



String fileName = IDOPM.xml;
File loadFile = new File(fileName);
StringBuffer fileContents = new StringBuffer();
BufferedReader input = new BufferedReader(new FileReader(loadFile));
String line = null;
    while ((line = input.readLine()) != null) {
  Matcher junkMatcher = (Pattern.compile("^([\\W]+)<")).matcher(line.trim());
  line = junkMatcher.replaceFirst("<");
  fileContents.append(line);
    }
System.out.println(fileContents.toString())



Here, I used the regular expression (^([\\W]+)<)to remove all the WORD-LESS characters which is in front of < character.

5 comments:

  1. Nice post.
    This is simple and best solution.
    It helped me a lot.

    ReplyDelete
  2. This is nice post. Simple, easy and quick. It helped me a lot.
    Thanks again. Keep up the good work.

    ReplyDelete
  3. Damn Useful, Thanks for posting it.

    ReplyDelete
  4. Thank you soc much for the clear explanation.

    ReplyDelete