Remove Byte Order Mark (BOM) characters from XML in Java
Sometime, in some Unicode encoded XML files you might noticed three digits of non-visible byte in the starting of the file content. If you open the XML file in any XML editors you can see those characters in the files.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
or
<U+FEFF><?xml version="1.0" encoding="UTF-8" standalone="yes"?>
This ("") three digit characters called Byte Order Mark (BOM). Byte order is determined by a BOM. Following table summarizes some of the properties of each of the UTFs encoding.
Name | UTF-8 | UTF-16 | UTF-16BE | UTF-16LE | UTF-32 | UTF-32BE | UTF-32LE |
---|---|---|---|---|---|---|---|
Smallest code point | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 |
Largest code point | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF |
Code unit size | 8 bits | 16 bits | 16 bits | 16 bits | 32 bits | 32 bits | 32 bits |
Byte order | N/A | <BOM> | big-endian | little-endian | <BOM> | big-endian | little-endian |
Fewest bytes per character | 1 | 2 | 2 | 2 | 4 | 4 | 4 |
Most bytes per character | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
In Java, program should remove these three digit before handling the XML file in the XML parser. Otherwise, the program will end-up with the javax.xml.xpath.XPathExpressionException and the following error message.
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
Program need to load the file content as a String and need to remove all the Bits in front of the "<?xml" element. Following code will remove all the unwanted contents from the XML file.
String fileName = IDOPM.xml;
File loadFile = new File(fileName);
StringBuffer fileContents = new StringBuffer();
BufferedReader input = new BufferedReader(new FileReader(loadFile));
String line = null;
while ((line = input.readLine()) != null) {
Matcher junkMatcher = (Pattern.compile("^([\\W]+)<")).matcher(line.trim());
line = junkMatcher.replaceFirst("<");
fileContents.append(line);
}
System.out.println(fileContents.toString())
Here, I used the regular expression (^([\\W]+)<)to remove all the WORD-LESS characters which is in front of < character.
5 comments:
Nice post.
This is simple and best solution.
It helped me a lot.
This is nice post. Simple, easy and quick. It helped me a lot.
Thanks again. Keep up the good work.
nice. thanks
Damn Useful, Thanks for posting it.
Thank you soc much for the clear explanation.
Post a Comment