You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Doug Cook <na...@candiru.com> on 2007/10/15 22:44:53 UTC

Anyone looked for a better HTML parser?


I've spent quite a bit of time working with both Neko and Tagsoup, and they
both have some fairly serious bugs:

Neko has some occasional hangs, and it doesn't deal very well with a fair
amount of "bad" HTML that displays just fine in a browser. 

Tagsoup is better in terms of handling "bad" HTML, but it has a pretty
serious bug in that HTML character entities are expanded in inappropriate
places, e.g. inside of hrefs, so that a dynamic URL of the form
http://www.foo.com/bar?x=1&sub=5 has problems: the &sub is interpreted as an
HTML character entity, and an invalid href is created.  John Cowan, the
author of Tagsoup, more or less said "yeah, I know, everybody mentions that,
but that's done at such a low level in the code it's not likely to get fixed
any time soon". (See a discussion of this and other issues at
http://tech.groups.yahoo.com/group/tagsoup-friends/message/838). 

The tagsoup bug affects some 3-4% of the sites in my index, so I consider it
fatal, and I *know* Neko misses some text, sometimes entire documents,
because it can't deal with pathological HTML.

Has anyone (a) got local fixes for any of these problems, or (b) found a
superior Java HTML parser out there?

Doug
-- 
View this message in context: http://www.nabble.com/Anyone-looked-for-a-better-HTML-parser--tf4630266.html#a13221500
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Anyone looked for a better HTML parser?

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

I looked at TagSoup sources and it seems it could be quite easily fixed. See here:

https://issues.apache.org/jira/browse/NUTCH-567

D.

Re: Anyone looked for a better HTML parser?

Posted by Doug Cook <na...@candiru.com>.

Sami Siren-2 wrote:
> 
> 
> Do you have urls of such bad content available to look at?
> 
> 

Thousands. Here is one:

http://www.valtravieso.com/ver_finca.phtml?idioma=1

The hrefs that have &amp;sub in them get interpreted as the subset character
by tagsoup, and thus become broken links. With a few sites (and I think this
is one) the number of URLs will grow ad infinitum if the site handles the
"broken link" by returning something that works and uses the input link as a
base.

I believe I have some examples of Neko problems around as well, I've been
gathering test cases...

 -Doug
-- 
View this message in context: http://www.nabble.com/Anyone-looked-for-a-better-HTML-parser--tf4630266.html#a13235164
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Anyone looked for a better HTML parser?

Posted by Sami Siren <ss...@gmail.com>.

Doug Cook wrote:
> The tagsoup bug affects some 3-4% of the sites in my index, so I consider it
> fatal, and I *know* Neko misses some text, sometimes entire documents,
> because it can't deal with pathological HTML.

Do you have urls of such bad content available to look at?

-- 
 Sami Siren