You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Benjamin Higgins <bh...@gmail.com> on 2006/08/16 00:23:56 UTC

Neko parsing fix inadvertently reverted?

Didn't get any response on nutch-dev (probably not a priority), so I thought
I'd share here, too...

I was taking a look at HtmlParser.java, and I think the fix to NUTCH-17 was
accidentally removed.  See:

http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.8/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=log

Specifically, in revision 160319, among other things, DOMFragmentParser was
changed to DOMParser, because, in the comment to that revision:

Changed to use NekoHTML's DOMParser instead of its DOMFragmentParser.
For some reason, the DOMFragmentParser can be very slow with large
documents while the DOMParser has no problems with these.  Also added


a main() that permits easier debugging.


However, in 179436, a big patch that included TagSoup among other things,
the change to DOMParser seems to have been lost.

I bring this up because I am having the exact same problem as described in
NUTCH-17.  I am using Neko 0.9.4.  It occurs on some particularly long
documents.  The fetcher simply hangs.  If I wait a few hours it will resume
again.  The HTML is nothing special; in fact, it's just a bunch of text
(html escaped ie < > & chars converted) inside a <pre> tag.

Comments?

Ben

P.S. I switched to DOMParser in my source, and it resolved the problem w/
the long documents.