Breeze: Remove Byte Order Mark (BOM) characters from XML in Java

Remove Byte Order Mark (BOM) characters from XML in Java

Monday, October 10, 2011

Sometime, in some Unicode encoded XML files you might noticed three digits of non-visible byte in the starting of the file content. If you open the XML file in any XML editors you can see those characters in the files.
ï»¿<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
or
<U+FEFF><?xml version="1.0" encoding="UTF-8" standalone="yes"?>

This ("ï»¿") three digit characters called Byte Order Mark (BOM). Byte order is determined by a BOM. Following table summarizes some of the properties of each of the UTFs encoding.

Name	UTF-8	UTF-16	UTF-16BE	UTF-16LE	UTF-32	UTF-32BE	UTF-32LE
Smallest code point	0000	0000	0000	0000	0000	0000	0000
Largest code point	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF
Code unit size	8 bits	16 bits	16 bits	16 bits	32 bits	32 bits	32 bits
Byte order	N/A	<BOM>	big-endian	little-endian	<BOM>	big-endian	little-endian
Fewest bytes per character	1	2	2	2	4	4	4
Most bytes per character	4	4	4	4	4	4	4

In Java, program should remove these three digit before handling the XML file in the XML parser. Otherwise, the program will end-up with the javax.xml.xpath.XPathExpressionException and the following error message.
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.

Program need to load the file content as a String and need to remove all the Bits in front of the "<?xml" element. Following code will remove all the unwanted contents from the XML file.

String fileName = IDOPM.xml;
File loadFile = new File(fileName);
StringBuffer fileContents = new StringBuffer();
BufferedReader input = new BufferedReader(new FileReader(loadFile));
String line = null;
    while ((line = input.readLine()) != null) {
  Matcher junkMatcher = (Pattern.compile("^([\\W]+)<")).matcher(line.trim());
  line = junkMatcher.replaceFirst("<");
  fileContents.append(line);
    }
System.out.println(fileContents.toString())

Here, I used the regular expression (^([\\W]+)<)to remove all the WORD-LESS characters which is in front of < character.

Read these Articles :

5 comments:

Knowledge Sharing said...: Nice post.
This is simple and best solution.
It helped me a lot.; April 17, 2012 at 12:09 PM
Knowledge Sharing said...: This is nice post. Simple, easy and quick. It helped me a lot.
Thanks again. Keep up the good work.; April 17, 2012 at 12:09 PM
Anonymous said...: nice. thanks; January 22, 2014 at 3:04 AM
Anonymous said...: Damn Useful, Thanks for posting it.; April 24, 2014 at 6:37 AM
Bibhaw said...: Thank you soc much for the clear explanation.; May 25, 2015 at 4:01 AM

About the Author

Articles

digg_url = 'http://shayanth.blogspot.com/2011/10/remove-byte-order-mark-bom-characters.html'; Remove Byte Order Mark (BOM) characters from XML in Java

5 comments:

Remove Byte Order Mark (BOM) characters from XML in Java