You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Alistair Young <al...@smo.uhi.ac.uk> on 2005/03/14 14:08:05 UTC

"Unconvertible UTF-8 character beginning with 0x91"

I wonder if anyone can suggest a way of ignoring or using a "default" 
char for the above error?

I'm using Xalan to prettify an RSS feed but one of the posts has been 
corrupted in a database upgrade on the weblog. The result is non UTF-8 
characters (they used to be single quotes) but now they're 0x91 and 
0x92.

It's using xerces to parse the rss xml. Is there any way to insert 
default chars for corrupted ones?

thanks,
Alistair


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


RE: "Unconvertible UTF-8 character beginning with 0x91"

Posted by "F. Andy Seidl" <fa...@myst-technology.com>.
One approach would be to write your own Reader that "fixes" invalid UTF-8
sequences as it encounters them.  The fix could be to ignore the sequence,
or, probably better, to attempt to replace the sequence with a best guess
character.  For example, you could assume that each invalid byte is really a
Windows-1251 character and make an appropriate substitution.  It would be
important that this reader also correctly handle *valid* UTF-8 sequences.

Armed with such a reader, you could attempt to parse a document normally.
If that fails, and if the document claims to be UTF-8 encoded, then you
would retry the parse using your custom Reader (rather than letting the
parser create its own reader).

  -- fas
 F. Andy Seidl, Co-founder
MyST Technology Partners
http://myst-technology.com | http://blogsite.com 
 
 

-----Original Message-----
From: Alistair Young [mailto:alistair@smo.uhi.ac.uk] 
Sent: Monday, March 14, 2005 8:08 AM
To: xerces-j-user@xml.apache.org
Subject: "Unconvertible UTF-8 character beginning with 0x91"

I wonder if anyone can suggest a way of ignoring or using a "default" 
char for the above error?

I'm using Xalan to prettify an RSS feed but one of the posts has been 
corrupted in a database upgrade on the weblog. The result is non UTF-8 
characters (they used to be single quotes) but now they're 0x91 and 
0x92.

It's using xerces to parse the rss xml. Is there any way to insert 
default chars for corrupted ones?

thanks,
Alistair


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org






---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org