You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Benjamin Higgins <bh...@gmail.com> on 2006/08/11 19:51:41 UTC

Neko parsing fix inadvertently reverted?

I was taking a look at HtmlParser.java, and I think the fix to NUTCH-17 was
accidentally removed.  See:

http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.8/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=log

Specifically, in revision 160319, among other things, DOMFragmentParser was
changed to DOMParser, because, in the comment to that revision:

Changed to use NekoHTML's DOMParser instead of its DOMFragmentParser.
For some reason, the DOMFragmentParser can be very slow with large
documents while the DOMParser has no problems with these.  Also added

a main() that permits easier debugging.


However, in 179436, a big patch that included TagSoup among other things,
the change to DOMParser seems to have been lost.

I bring this up because I am having the exact same problem as described in
NUTCH-17.  I am using Neko 0.9.4.  It occurs on some particularly long
documents.  The fetcher simply hangs.  If I wait a few hours it will resume
again.  The HTML is nothing special; in fact, it's just a bunch of text
(html escaped ie < > & chars converted) inside a <pre> tag.

Comments?

Ben

Re: Neko parsing fix inadvertently reverted?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Sami Siren wrote:
> Benjamin Higgins wrote:
>> Comments?
>
> I cannot comment on the issue itself, but if you can submit a patch 
> (perhaps with testcase that demonstrates this) then it will be easier 
> to  act on.

Benjamin,

Could you please send me a copy of the offending HTML for testing (off 
the list)?

A little background: I knew of this issue when I changed the API to use 
DocumentFragment. However, as far as I was able to test it with the most 
recent version of Neko at that time, it didn't exhibit this problem.

The main motivation for this was to enable better parsing of broken 
documents with multiple <html> tags (or no <html> at all, but <head> and 
<body> as "root" elements). While this is not possible using a Document, 
it is possible to do this using a DocumentFragment (which doesn't 
necessarily have to represent any well-formed XML tree; and 
specifically, it doesn't require that there is a single root node - 
please see the Javadoc of org.w3c.dom.DocumentFragment for longer 
explanation).

So, if we change it back to Document we will lose this functionality, 
and some pages will be severely truncated, because in such cases 
NekoHTML takes only the first "pseudo-root" node and discards all 
others. However, if you are dealing mostly with well-formed documents 
you may not need this ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Neko parsing fix inadvertently reverted?

Posted by Sami Siren <ss...@gmail.com>.

Benjamin Higgins wrote:
> I was taking a look at HtmlParser.java, and I think the fix to NUTCH-17 was
> accidentally removed.  See:
> 
> http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.8/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=log 
> 
> 
> Specifically, in revision 160319, among other things, DOMFragmentParser was
> changed to DOMParser, because, in the comment to that revision:
> 
> Changed to use NekoHTML's DOMParser instead of its DOMFragmentParser.
> For some reason, the DOMFragmentParser can be very slow with large
> documents while the DOMParser has no problems with these.  Also added
> 
> a main() that permits easier debugging.
> 
> 
> However, in 179436, a big patch that included TagSoup among other things,
> the change to DOMParser seems to have been lost.
> 
> I bring this up because I am having the exact same problem as described in
> NUTCH-17.  I am using Neko 0.9.4.  It occurs on some particularly long
> documents.  The fetcher simply hangs.  If I wait a few hours it will resume
> again.  The HTML is nothing special; in fact, it's just a bunch of text
> (html escaped ie < > & chars converted) inside a <pre> tag.
> 
> Comments?

I cannot comment on the issue itself, but if you can submit a patch 
(perhaps with testcase that demonstrates this) then it will be easier to 
  act on.

--
  Sami Siren