You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Pierre Belzile <pi...@gmail.com> on 2006/07/03 00:12:55 UTC

Handling entities with partial DOCTYPE

Hi,

I grabbed a web page from a news web site, ran it through "tidy" to obtain
xhtml and attempted to parse it using SAX2. It throws an exception on
DOCTYPE and (if I remove it) the first "&nbsp;".

The document DOCTYPE is:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en" xmlns="http://www.w3zor.org/1999/xhtml" xml:lang="en">

This fails to parse in Xercesc. I suspect because the publicId is found but
the systemId is missing. Any way to make this work without editing the doc?

If I remove the DOCTYPE line or choose the DG Scanner, now this fails
because there is no DTD and "nbsp" is never specified as an entity. I
suspect a browser does entity replacement like this automatically. Is there
a good way for me to add standard entities to the grammar (i.e., beyond the
5 basic ones that is already knows about)? Or is it time to switch to
another approach (recommendations?)

Cheers, Pierre

RE: Handling entities with partial DOCTYPE

Posted by Bill <wi...@comcast.net>.
Hi Pierre,

I used a similar approach to trying to read web pages in using xerces-c.  As
far as entities are concerned, the only one that gave me a similar problem
was &nbsp; and I ended up scanning for that myself and replacing it, perhaps
not the most savvy approach but it works fine now.  I don't remember having
Any problem with the DOCTYPE.

Bill

-----Original Message-----
From: David Bertoni [mailto:dbertoni@apache.org] 
Sent: Monday, July 03, 2006 11:52 AM
To: c-users@xerces.apache.org
Subject: Re: Handling entities with partial DOCTYPE

Pierre Belzile wrote:
> Hi,
> 
> I grabbed a web page from a news web site, ran it through "tidy" to obtain
> xhtml and attempted to parse it using SAX2. It throws an exception on
> DOCTYPE and (if I remove it) the first "&nbsp;".

Unless the DTD defines what the entity "nbsp" is, the parser will report an
undefined entity error.

> 
> The document DOCTYPE is:
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html lang="en" xmlns="http://www.w3zor.org/1999/xhtml" xml:lang="en">
> 
> This fails to parse in Xercesc. I suspect because the publicId is found
but
> the systemId is missing. Any way to make this work without editing the
doc?

The XML recommendation requires a System ID if a public ID is specified:

http://www.w3.org/TR/REC-xml/#NT-ExternalID

Xerces-C is an XML parser, not an HTML parser, so who knows whether it could
even parse the HTML DTD?

> 
> If I remove the DOCTYPE line or choose the DG Scanner, now this fails
> because there is no DTD and "nbsp" is never specified as an entity. I
> suspect a browser does entity replacement like this automatically. Is
there
> a good way for me to add standard entities to the grammar (i.e., beyond
the
> 5 basic ones that is already knows about)? Or is it time to switch to
> another approach (recommendations?)

Tidy may fix up unbalanced elements, etc., but unless it replaces
pre-defined HTML entities with their actual code 
points, the parser will report them as undefined entities.

You could always create your own DTD that contains all of the HTML entities.

Dave


Re: Handling entities with partial DOCTYPE

Posted by David Bertoni <db...@apache.org>.
Pierre Belzile wrote:
> Hi,
> 
> I grabbed a web page from a news web site, ran it through "tidy" to obtain
> xhtml and attempted to parse it using SAX2. It throws an exception on
> DOCTYPE and (if I remove it) the first "&nbsp;".

Unless the DTD defines what the entity "nbsp" is, the parser will report an undefined entity error.

> 
> The document DOCTYPE is:
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html lang="en" xmlns="http://www.w3zor.org/1999/xhtml" xml:lang="en">
> 
> This fails to parse in Xercesc. I suspect because the publicId is found but
> the systemId is missing. Any way to make this work without editing the doc?

The XML recommendation requires a System ID if a public ID is specified:

http://www.w3.org/TR/REC-xml/#NT-ExternalID

Xerces-C is an XML parser, not an HTML parser, so who knows whether it could even parse the HTML DTD?

> 
> If I remove the DOCTYPE line or choose the DG Scanner, now this fails
> because there is no DTD and "nbsp" is never specified as an entity. I
> suspect a browser does entity replacement like this automatically. Is there
> a good way for me to add standard entities to the grammar (i.e., beyond the
> 5 basic ones that is already knows about)? Or is it time to switch to
> another approach (recommendations?)

Tidy may fix up unbalanced elements, etc., but unless it replaces pre-defined HTML entities with their actual code 
points, the parser will report them as undefined entities.

You could always create your own DTD that contains all of the HTML entities.

Dave