Best view in Firefox and Chrome

Remove Byte Order Mark (BOM) characters from XML in Java

Monday, October 10, 2011
Convert Article to PDFPrint ArticleEmail Article to FrinedBookmark this Article


Sometime, in some Unicode encoded XML files you might noticed three digits of non-visible byte in the starting of the file content. If you open the XML file in any XML editors you can see those characters in the files.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
or
<U+FEFF><?xml version="1.0" encoding="UTF-8" standalone="yes"?>

This ("") three digit characters called Byte Order Mark (BOM). Byte order is determined by a BOM. Following table summarizes some of the properties of each of the UTFs encoding.

Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE
Smallest code point 0000 0000 0000 0000 0000 0000 0000
Largest code point 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF
Code unit size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits
Byte order N/A <BOM> big-endian little-endian <BOM> big-endian little-endian
Fewest bytes per character 1 2 2 2 4 4 4
Most bytes per character 4 4 4 4 4 4 4

In Java, program should remove these three digit before handling the XML file in the XML parser. Otherwise, the program will end-up with the javax.xml.xpath.XPathExpressionException and the following error message.
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.

Program need to load the file content as a String and need to remove all the Bits in front of the "<?xml" element. Following code will remove all the unwanted contents from the XML file.



String fileName = IDOPM.xml;
File loadFile = new File(fileName);
StringBuffer fileContents = new StringBuffer();
BufferedReader input = new BufferedReader(new FileReader(loadFile));
String line = null;
    while ((line = input.readLine()) != null) {
  Matcher junkMatcher = (Pattern.compile("^([\\W]+)<")).matcher(line.trim());
  line = junkMatcher.replaceFirst("<");
  fileContents.append(line);
    }
System.out.println(fileContents.toString())



Here, I used the regular expression (^([\\W]+)<)to remove all the WORD-LESS characters which is in front of < character.

Read these Articles :

5 comments:

Knowledge Sharing said...

Nice post.
This is simple and best solution.
It helped me a lot.

Knowledge Sharing said...

This is nice post. Simple, easy and quick. It helped me a lot.
Thanks again. Keep up the good work.

Anonymous said...

nice. thanks

Anonymous said...

Damn Useful, Thanks for posting it.

Bibhaw said...

Thank you soc much for the clear explanation.

Why Breeze

This is a new technological rhythm in the web. It blooms like a Breeze in your technological face with more than 75 technoligical atricles and guidences.

Contact Me
Contact Me Send an Email Facebook Account LinkedIn profile Twitter/Shayanth
Share And Save
Social Share with Twitter Facebook Share Add to Google Save on Delicious
Get Updates

The RSS will facilitate you to get the updates from Breeze. Click Here to redirect to the RSS link.

Designed by Posicionamiento Web | Bloggerized by GosuBlogger | Blue Business Blogger | Customized by SHAUOM