You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2014/01/23 22:11:40 UTC

[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

     [ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1253:
----------------------------------------

    Attachment: TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt

Actually, I can confirm that this upgrade seems to break tests in TestDOMContentUtils (see attached). It seems that the document fragment contains some markup... which is incorrect. Having sat in Eclipse for ages debugging this, I am now over on the nekohtml user list trying to sort this out. If you guys have any ideas then please chip in.
I am not sure whether it's the way we use Xerces2, Neko or maybe a bug in DomContentUtils but there is undesired behavior anyway.

> Incompatible neko and xerces versions
> -------------------------------------
>
>                 Key: NUTCH-1253
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1253
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>         Environment: Ubuntu 10.04
>            Reporter: Dennis Spathis
>            Assignee: Lewis John McGibbney
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt
>
>
> The Nutch 1.4 distribution includes
>  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
> nekohtml)
>  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
> These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a
> catch(Throwable) clause in the getParse method to log the stacktrace.)
> I found that substituting a later, compatible version of nekohtml (1.9.11)
> fixes the problem.
> Curiously, and in support of the above, the nekohtml plugin.xml file in
> Nutch 1.4 contains the following:
> <plugin
>    id="lib-nekohtml"
>    name="CyberNeko HTML Parser"
>    version="1.9.11"
>    provider-name="org.cyberneko">
>    <runtime>
>        <library name="nekohtml-0.9.5.jar">
>            <export name="*"/>
>        </library>
>    </runtime>
> </plugin>
> Note the conflicting version numbers (version tag is "1.9.11" but the
> specified library is "nekohtml-0.9.5.jar").
> Was the 0.9.5 version included by mistake? Was the intention rather to
> include 1.9.11?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)