You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Jon Shoberg <js...@cbd.net> on 2002/01/08 22:54:50 UTC
Help - Other options - An invalid XML character (Unicode: 0x6) was found in the element content of the document
Is there another XML parser worth trying that a little more "dumb" ?
I am trying to parse the content feed from dmoz.com
(http://www.dmoz.com/rdf). I jsut happen to hit a bad unicode character in
the element content of the document. Is there a way to over-ride a fatal
error and keep parsing? At this point I'm starting to look for a C++ parser
that can work past this. Expat? IBM C++ Parser ? Others ? Suggestions?
Maybe time for (gasp) perl ? :)
Its taken me forever an a day to conclude that its not the encoding its
likely complaining about but more so the specific character in the element
content. I've tried reading the document once and converting each character
to UTF-8. That process worked fine, but in reading back though, xerces,
same problem ....
Any thoughts would be appreciated, more info below ...
Jon
So here is the error I am getting .....
##############################
[Fatal Error] :2064390:158: An invalid XML character (Unicode: 0x6) was
found in the element content of the document.
org.xml.sax.SAXException: Stopping after fatal error: An invalid XML
character (Unicode: 0x6) was found in the element content of the document.
at
org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1245)
at
org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocume
ntScanner.java:588)
at
org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XM
LDocumentScanner.java:1304)
at
org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.
java:381)
at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1098)
at dmoz.main(dmoz.java:134)
##############################
The XML document I am dealing with does start properly with ....
##############################
<?xml version='1.0' encoding='UTF-8' ?>
<RDF xmlns:r="http://www.w3.org/TR/RDF/"
xmlns:d="http://purl.org/dc/elements/1.0/"
xmlns="http://dmoz.org/rdf">
<!-- Generated at 2002-01-03 04:01:53 GMT on -->
<Topic r:id="Top">
<catid>1</catid>
</Topic>
...
...
... etc
##############################
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
Re: Help - Other options
Posted by Andy Clark <an...@apache.org>.
Jon Shoberg wrote:
> You have a unicode 0x6 hanging in there with two other oddball symbols
> (E?). When I get back to work in the morning I'll give this a shot ...
> Once again, I certainly appreciate the replies.
Strange. I tried various encodings (including Japanese since
the text mentions kanji characters) with no luck. My guess is
that the characters output are either a) in an unknown encoding
that is not specified in the document, or b) there is garbage
in the database. Neither of which helps you much, though... :(
--
Andy Clark * andyc@apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
RE: Help - Other options
Posted by Jon Shoberg <js...@cbd.net>.
Andy,
Thanks for your help, and the others who have replied. I'd agree its
certainly between a rock and hard spot. :( I've already sent a reply back
to the editor to fix the listing. Being a 900+MB XML/RDF file I can't
exactly remove by hand all of the offending characters.
As a reference point, take a look at the first listing and description for
"Crystal Clear Characters." The name is a bit ironic ...
http://search.dmoz.org/cgi-bin/search?search=ccc&all=no&cat=Arts
You have a unicode 0x6 hanging in there with two other oddball symbols
(E?). When I get back to work in the morning I'll give this a shot ...
Once again, I certainly appreciate the replies.
Jon
PS: Andy, I'll start reading the DOCs better too :)
-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: Tuesday, January 08, 2002 7:41 PM
To: xerces-j-user@xml.apache.org
Subject: Re: Help - Other options - An invalid XML character (Unicode:
0x6) was found in the element content of the document
Jon Shoberg wrote:
> Is there another XML parser worth trying that a little more "dumb" ?
>
> I am trying to parse the content feed from dmoz.com
> (http://www.dmoz.com/rdf). I jsut happen to hit a bad unicode character
in
> the element content of the document. Is there a way to over-ride a fatal
> error and keep parsing? At this point I'm starting to look for a C++
parser
There is a Xerces feature that lets you continue after fatal
errors. I don't recommend its use but you're stuck between a
rock and a hard place if the document you have to deal with
is not well-formed. The featureId is the following:
http://apache.org/xml/features/continue-after-fatal-error
Check the following page for documentation:
http://xml.apache.org/xerces2-j/features.html
I'm pointing you to the Xerces2 documentation but the feature
also exists in Xerces 1.x, as well.
But you really try to attack this problem at the source --
go to the folks at DMOZ and get them to produce valid and
well-formed XML.
--
Andy Clark * andyc@apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
Re: Help - Other options - An invalid XML character (Unicode: 0x6) was
found in the element content of the document
Posted by Andy Clark <an...@apache.org>.
Jon Shoberg wrote:
> Is there another XML parser worth trying that a little more "dumb" ?
>
> I am trying to parse the content feed from dmoz.com
> (http://www.dmoz.com/rdf). I jsut happen to hit a bad unicode character in
> the element content of the document. Is there a way to over-ride a fatal
> error and keep parsing? At this point I'm starting to look for a C++ parser
There is a Xerces feature that lets you continue after fatal
errors. I don't recommend its use but you're stuck between a
rock and a hard place if the document you have to deal with
is not well-formed. The featureId is the following:
http://apache.org/xml/features/continue-after-fatal-error
Check the following page for documentation:
http://xml.apache.org/xerces2-j/features.html
I'm pointing you to the Xerces2 documentation but the feature
also exists in Xerces 1.x, as well.
But you really try to attack this problem at the source --
go to the folks at DMOZ and get them to produce valid and
well-formed XML.
--
Andy Clark * andyc@apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org