You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Jon Shoberg <js...@cbd.net> on 2002/01/08 22:54:50 UTC

Help - Other options - An invalid XML character (Unicode: 0x6) was found in the element content of the document

Is there another XML parser worth trying that a little more "dumb" ?

I am trying to parse the content feed from dmoz.com
(http://www.dmoz.com/rdf).  I jsut happen to hit a bad unicode character in
the element content of the document.  Is there a way to over-ride a fatal
error and keep parsing?  At this point I'm starting to look for a C++ parser
that can work past this. Expat? IBM C++ Parser ? Others ? Suggestions?
Maybe time for (gasp) perl ? :)

Its taken me forever an a day to conclude that its not the encoding its
likely complaining about but more so the specific character in the element
content.  I've tried reading the document once and converting each character
to UTF-8.  That process worked fine, but in reading back though, xerces,
same problem ....

Any thoughts would be appreciated, more info below ...

Jon

So here is the error I am getting .....
##############################
[Fatal Error] :2064390:158: An invalid XML character (Unicode: 0x6) was
found in the element content of the document.
org.xml.sax.SAXException: Stopping after fatal error: An invalid XML
character (Unicode: 0x6) was found in the element content of the document.
        at
org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1245)
        at
org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocume
ntScanner.java:588)
        at
org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XM
LDocumentScanner.java:1304)
        at
org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.
java:381)
        at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1098)
        at dmoz.main(dmoz.java:134)
##############################

The XML document I am dealing with does start properly with ....
##############################
<?xml version='1.0' encoding='UTF-8' ?>
<RDF xmlns:r="http://www.w3.org/TR/RDF/"
     xmlns:d="http://purl.org/dc/elements/1.0/"
     xmlns="http://dmoz.org/rdf">

<!-- Generated at 2002-01-03 04:01:53 GMT on  -->

<Topic r:id="Top">
  <catid>1</catid>
</Topic>
...
...
... etc
##############################




---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Help - Other options

Posted by Andy Clark <an...@apache.org>.
Jon Shoberg wrote:
>         You have a unicode 0x6 hanging in there with two other oddball symbols
> (E?).  When I get back to work in the morning I'll give this a shot ...
> Once again, I certainly appreciate the replies.

Strange. I tried various encodings (including Japanese since
the text mentions kanji characters) with no luck. My guess is
that the characters output are either a) in an unknown encoding
that is not specified in the document, or b) there is garbage
in the database. Neither of which helps you much, though... :(

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


RE: Help - Other options

Posted by Jon Shoberg <js...@cbd.net>.
Andy,

	Thanks for your help, and the others who have replied.  I'd agree its
certainly between a rock and hard spot. :(  I've already sent a reply back
to the editor to fix the listing.  Being a 900+MB XML/RDF file I can't
exactly remove by hand all of the offending characters.

	As a reference point, take a look at the first listing and description for
"Crystal Clear Characters." The name is a bit ironic ...
	http://search.dmoz.org/cgi-bin/search?search=ccc&all=no&cat=Arts

	You have a unicode 0x6 hanging in there with two other oddball symbols
(E?).  When I get back to work in the morning I'll give this a shot ...
Once again, I certainly appreciate the replies.

Jon

PS: Andy, I'll start reading the DOCs better too :)


-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: Tuesday, January 08, 2002 7:41 PM
To: xerces-j-user@xml.apache.org
Subject: Re: Help - Other options - An invalid XML character (Unicode:
0x6) was found in the element content of the document


Jon Shoberg wrote:
> Is there another XML parser worth trying that a little more "dumb" ?
>
> I am trying to parse the content feed from dmoz.com
> (http://www.dmoz.com/rdf).  I jsut happen to hit a bad unicode character
in
> the element content of the document.  Is there a way to over-ride a fatal
> error and keep parsing?  At this point I'm starting to look for a C++
parser

There is a Xerces feature that lets you continue after fatal
errors. I don't recommend its use but you're stuck between a
rock and a hard place if the document you have to deal with
is not well-formed. The featureId is the following:

  http://apache.org/xml/features/continue-after-fatal-error

Check the following page for documentation:

  http://xml.apache.org/xerces2-j/features.html

I'm pointing you to the Xerces2 documentation but the feature
also exists in Xerces 1.x, as well.

But you really try to attack this problem at the source --
go to the folks at DMOZ and get them to produce valid and
well-formed XML.

--
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Help - Other options - An invalid XML character (Unicode: 0x6) was found in the element content of the document

Posted by Andy Clark <an...@apache.org>.
Jon Shoberg wrote:
> Is there another XML parser worth trying that a little more "dumb" ?
> 
> I am trying to parse the content feed from dmoz.com
> (http://www.dmoz.com/rdf).  I jsut happen to hit a bad unicode character in
> the element content of the document.  Is there a way to over-ride a fatal
> error and keep parsing?  At this point I'm starting to look for a C++ parser

There is a Xerces feature that lets you continue after fatal
errors. I don't recommend its use but you're stuck between a
rock and a hard place if the document you have to deal with 
is not well-formed. The featureId is the following:

  http://apache.org/xml/features/continue-after-fatal-error

Check the following page for documentation:

  http://xml.apache.org/xerces2-j/features.html

I'm pointing you to the Xerces2 documentation but the feature
also exists in Xerces 1.x, as well.

But you really try to attack this problem at the source --
go to the folks at DMOZ and get them to produce valid and
well-formed XML.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org