You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Joseph Kesselman <ke...@us.ibm.com> on 2003/01/03 05:27:28 UTC

Re: Valid XML characters

On Thursday, 12/26/2002 at 07:23 ZE2, "Dima Gutzeit" <di...@mailvision.net> 
wrote:
> Sometimes when parsing XML files I get an error message(exception) about 

> "invalid Unicode characters" , is there any way to filter those before 
parsing ?

There's no way to do that within the parser. "If it contains illegal 
characters, it isn't XML" and the error messages are entirely correct.

You could, of course, write your own stream filter and pass the data 
through that, then use its output as the input to the parser. That's 
fairly straightforward Java coding. The problem would be deciding what 
you're going to do with those characters when you see them -- if you just 
discard them you may be changing the meaning of the document, and if you 
turn them into some sort of private escape sequence only applications 
which understand that convention will be able to do anything with them. 
Fixing the source documents really is the cleanest answer.

For what it's worth: It has been proposed that future versions of XML 
*may* relax the forbidden-character restrictions, but there's still no 
firm consensus on whether that change would be desirable or what version 
of XML it might find its way into.

______________________________________
Joe Kesselman  / IBM Research


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org