Breeze: Remove Byte Order Mark (BOM) characters from XML in Java

Sometime, in some Unicode encoded XML files you might noticed three digits of non-visible byte in the starting of the file content. If you open the XML file in any XML editors you can see those characters in the files.
ï»¿<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
or
<U+FEFF><?xml version="1.0" encoding="UTF-8" standalone="yes"?>

This ("ï»¿") three digit characters called Byte Order Mark (BOM). Byte order is determined by a BOM. Following table summarizes some of the properties of each of the UTFs encoding.

Name	UTF-8	UTF-16	UTF-16BE	UTF-16LE	UTF-32	UTF-32BE	UTF-32LE
Smallest code point	0000	0000	0000	0000	0000	0000	0000
Largest code point	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF
Code unit size	8 bits	16 bits	16 bits	16 bits	32 bits	32 bits	32 bits
Byte order	N/A	<BOM>	big-endian	little-endian	<BOM>	big-endian	little-endian
Fewest bytes per character	1	2	2	2	4	4	4
Most bytes per character	4	4	4	4	4	4	4

In Java, program should remove these three digit before handling the XML file in the XML parser. Otherwise, the program will end-up with the javax.xml.xpath.XPathExpressionException and the following error message.
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.

Program need to load the file content as a String and need to remove all the Bits in front of the "<?xml" element. Following code will remove all the unwanted contents from the XML file.

String fileName = IDOPM.xml;
File loadFile = new File(fileName);
StringBuffer fileContents = new StringBuffer();
BufferedReader input = new BufferedReader(new FileReader(loadFile));
String line = null;
    while ((line = input.readLine()) != null) {
  Matcher junkMatcher = (Pattern.compile("^([\\W]+)<")).matcher(line.trim());
  line = junkMatcher.replaceFirst("<");
  fileContents.append(line);
    }
System.out.println(fileContents.toString())

Here, I used the regular expression (^([\\W]+)<)to remove all the WORD-LESS characters which is in front of < character.

Breeze

Monday, October 10, 2011

Remove Byte Order Mark (BOM) characters from XML in Java

5 comments: