You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Jose Emanuel Palmeiro <je...@students.fct.unl.pt> on 2002/04/02 00:56:56 UTC

Error parsing HTML with Sax

Hi!

I'm parsing an html file with sax, and an error is thrown when an 
entity &nbsp is found. The error thrown is "The entity 'nbsp' was 
referenced, but not declared". How can i extract this expresssion 
(&nbsp or other with &) without an error thrown.

Best regards

-- 
José Emanuel Gomes das Neves Palmeiro
Consultor
EB-Focus, SA
Grupo Tecnidata

telefone: 91 6610830
e-mail: jose.palmeiro@eb-focus.pt

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Error parsing HTML with Sax

Posted by Andy Clark <an...@apache.org>.
Jose Emanuel Palmeiro wrote:
> I'm parsing an html file with sax, and an error is thrown when an
> entity &nbsp is found. The error thrown is "The entity 'nbsp' was
> referenced, but not declared". How can i extract this expresssion
> (&nbsp or other with &) without an error thrown.

XML parsers can *not* be used to read most HTML documents. HTML
is an instance of an SGML grammar and XML is a subset of SGML.
Things like optional end tags and not quoting attribute values
is allowed in SGML but not in XML. But there is hope!

There are a variety of HTML parsers available that you can use.
Probably the oldest and most widely used is JTidy (which is the
Java port of the W3C Tidy HTML fixer). Check out the following
link to download the code: http://sourceforge.net/projects/jtidy

I also wrote a simple HTML parser and tag balancer specifically
for Xerces2, called NekoHTML. It doesn't do everything that JTidy
can do but is quite useful none the less. Check out the following
page to download the code: http://www.apache.org/~andyc/

Both solutions come with a "nice" license so that the code can
be used in commercial applications, if needed.

Hope this helps. Good luck!

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org