You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cook <na...@candiru.com> on 2007/10/15 22:44:53 UTC
Anyone looked for a better HTML parser?
I've spent quite a bit of time working with both Neko and Tagsoup, and they
both have some fairly serious bugs:
Neko has some occasional hangs, and it doesn't deal very well with a fair
amount of "bad" HTML that displays just fine in a browser.
Tagsoup is better in terms of handling "bad" HTML, but it has a pretty
serious bug in that HTML character entities are expanded in inappropriate
places, e.g. inside of hrefs, so that a dynamic URL of the form
http://www.foo.com/bar?x=1&sub=5 has problems: the &sub is interpreted as an
HTML character entity, and an invalid href is created. John Cowan, the
author of Tagsoup, more or less said "yeah, I know, everybody mentions that,
but that's done at such a low level in the code it's not likely to get fixed
any time soon". (See a discussion of this and other issues at
http://tech.groups.yahoo.com/group/tagsoup-friends/message/838).
The tagsoup bug affects some 3-4% of the sites in my index, so I consider it
fatal, and I *know* Neko misses some text, sometimes entire documents,
because it can't deal with pathological HTML.
Has anyone (a) got local fixes for any of these problems, or (b) found a
superior Java HTML parser out there?
Doug
--
View this message in context: http://www.nabble.com/Anyone-looked-for-a-better-HTML-parser--tf4630266.html#a13221500
Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Anyone looked for a better HTML parser?
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
I looked at TagSoup sources and it seems it could be quite easily fixed. See here:
https://issues.apache.org/jira/browse/NUTCH-567
D.
Re: Anyone looked for a better HTML parser?
Posted by Doug Cook <na...@candiru.com>.
Sami Siren-2 wrote:
>
>
> Do you have urls of such bad content available to look at?
>
>
Thousands. Here is one:
http://www.valtravieso.com/ver_finca.phtml?idioma=1
The hrefs that have &sub in them get interpreted as the subset character
by tagsoup, and thus become broken links. With a few sites (and I think this
is one) the number of URLs will grow ad infinitum if the site handles the
"broken link" by returning something that works and uses the input link as a
base.
I believe I have some examples of Neko problems around as well, I've been
gathering test cases...
-Doug
--
View this message in context: http://www.nabble.com/Anyone-looked-for-a-better-HTML-parser--tf4630266.html#a13235164
Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Anyone looked for a better HTML parser?
Posted by Sami Siren <ss...@gmail.com>.
Doug Cook wrote:
> The tagsoup bug affects some 3-4% of the sites in my index, so I consider it
> fatal, and I *know* Neko misses some text, sometimes entire documents,
> because it can't deal with pathological HTML.
Do you have urls of such bad content available to look at?
--
Sami Siren