You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ar...@csiro.au on 2018/06/12 06:46:01 UTC

Nutch 1.14 issues

Hi guys,

I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 1.14 and Solr 7.2, and I have come across a few serious issues, of which you should be aware:


1.       The Nutch-2071 is still an issue in 1.14, because the returned parseResult is never null. If a parser fails to parse a document, it returns an empty result, but not null. This means that, from a chain of parser candidates, only the first one has a chance to try to parse the document.

2.       Nutch adopted Tika as a general parsing tool, and stopped supporting "legacy" parsing (OO, MS) plugins. I continued using them and hoped to stop supporting them in the next version of Arch I am preparing to be released, but I still can't do it, because Tika fails to parse too many documents on our site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% parsing success rate. This is why NUTCH-2071 is important for Arch. I think you should bring back legacy parsers to Nutch, because the quality of parsing of "real life" data, such as ours, is not great without them.

3.       The lines defining fall-back (*) plugin in parse-plugins.xml are not effective, because they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some cases, Nutch assigns * capability to plugins that don't even claim it. For example, I can't understand, why Arch content blocking plugin gets it.

4.       In earlier versions of Nutch, use of the native libraries really helped. It reduced crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I don't notice this. I've obtained Hadoop libraries, placed them where they are expected, even inserted an explicit load library call in my code, but I still don't notice any significant time savings.

5.       The Feed plugin seems to have a major problem. The line 102 in  FeedIndexingFilter.java generated a NumberFormatException (which caused the failure of the entire crawling process!) because it was trying to parse a date in string format, not a number. Given that this metadata piece was generated by the feed parser (same plugin), it seems that the plugin is in disagreement with itself.

6.       This is less important, but when Tika fails to parse a document, it generates a scary error message and ugly stack trace. I think this should be a one line warning, because other parsers may still parse this document successfully.

Hope this helps.

Regards,

Arkadi

Re: Nutch 1.14 issues

Posted by Ar...@csiro.au.
Hi Sebastian,

Sorry, clarifying my objectives:

I am not frustrated, just trying to help. I did not write this message to request fixes for Arch. All these issues have been fixed in Arch, except perhaps the native library issue, but I may fix it as well, if lucky enough. I wrote that message to contribute back to Nutch, because I consider these issues (at least, some of them) very important for Nutch.

I do understand that Nutch is supported by volunteers, and I really appreciate the work your are doing.

I will open JIRA issues.

Regards,

Arkadi   
________________________________________
From: Sebastian Nagel <wa...@googlemail.com>
Sent: Wednesday, 13 June 2018 12:24 AM
To: dev@nutch.apache.org
Subject: Re: Nutch 1.14 issues

Hi Arkadi,

thanks for your feedback and suggestions.
I can understand your frustration but I also want to clarify:

- Arch is a nice project, for sure. But Arch is GPL licensed
  which makes contributions a one-way route (Nutch -> Arch)
  and causes me even not to look into the Arch sources. Sorry.

- Please take the time to split your list of issues into separate
  requests on the mailing list or open separate Jira issues.
  Also take care that the problems are reproducible by sharing
  documents failed to parse, log snippets, config files, etc.

- Sorry about NUTCH-2071, I took this mainly as a class path issue
  in the parse-tika plugin (which is solved). Now I understand better
  what your objective is and I'll will review and try to fix it
  (in combination with NUTCH-1993). But again: please take the time
  to explain your objectives, ping committers if fixes make no progress,
  etc.

- Nutch is a community project. There are no "paid" committers. This
  means although some of us are paid to configure/operate/adapt crawlers
  nobody is delegated to fix issues, support Nutch users, etc.
  That's voluntary work.

- Everybody is welcome to contribute (patches, documentation, support
  on the mailing list, etc.)  Because Nutch is a small project this
  will help us definitely.


Thanks,
Sebastian



On 06/12/2018 08:46 AM, Arkadi.Kosmynin@csiro.au wrote:
> Hi guys,
>
>
>
> I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 1.14 and Solr 7.2,
> and I have come across a few serious issues, of which you should be aware:
>
>
>
> 1.       The Nutch-2071 is still an issue in 1.14, because the returned parseResult is never null.
> If a parser fails to parse a document, it returns an empty result, but not null. This means that,
> from a chain of parser candidates, only the first one has a chance to try to parse the document.
>
> 2.       Nutch adopted Tika as a general parsing tool, and stopped supporting “legacy” parsing (OO,
> MS) plugins. I continued using them and hoped to stop supporting them in the next version of Arch I
> am preparing to be released, but I still can’t do it, because Tika fails to parse too many documents
> on our site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% parsing
> success rate. This is why NUTCH-2071 is important for Arch. I think you should bring back legacy
> parsers to Nutch, because the quality of parsing of “real life” data, such as ours, is not great
> without them.
>
> 3.       The lines defining fall-back (*) plugin in parse-plugins.xml are not effective, because
> they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some
> cases, Nutch assigns * capability to plugins that don’t even claim it. For example, I can’t
> understand, why Arch content blocking plugin gets it.
>
> 4.       In earlier versions of Nutch, use of the native libraries really helped. It reduced
> crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I don’t notice this. I’ve
> obtained Hadoop libraries, placed them where they are expected, even inserted an explicit load
> library call in my code, but I still don’t notice any significant time savings.
>
> 5.       The Feed plugin seems to have a major problem. The line 102 in  FeedIndexingFilter.java
> generated a NumberFormatException (which caused the failure of the entire crawling process!) because
> it was trying to parse a date in string format, not a number. Given that this metadata piece was
> generated by the feed parser (same plugin), it seems that the plugin is in disagreement with itself.
>
> 6.       This is less important, but when Tika fails to parse a document, it generates a scary error
> message and ugly stack trace. I think this should be a one line warning, because other parsers may
> still parse this document successfully.
>
>
>
> Hope this helps.
>
>
>
> Regards,
>
>
>
> Arkadi
>


Re: Nutch 1.14 issues

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Arkadi,

thanks for your feedback and suggestions.
I can understand your frustration but I also want to clarify:

- Arch is a nice project, for sure. But Arch is GPL licensed
  which makes contributions a one-way route (Nutch -> Arch)
  and causes me even not to look into the Arch sources. Sorry.

- Please take the time to split your list of issues into separate
  requests on the mailing list or open separate Jira issues.
  Also take care that the problems are reproducible by sharing
  documents failed to parse, log snippets, config files, etc.

- Sorry about NUTCH-2071, I took this mainly as a class path issue
  in the parse-tika plugin (which is solved). Now I understand better
  what your objective is and I'll will review and try to fix it
  (in combination with NUTCH-1993). But again: please take the time
  to explain your objectives, ping committers if fixes make no progress,
  etc.

- Nutch is a community project. There are no "paid" committers. This
  means although some of us are paid to configure/operate/adapt crawlers
  nobody is delegated to fix issues, support Nutch users, etc.
  That's voluntary work.

- Everybody is welcome to contribute (patches, documentation, support
  on the mailing list, etc.)  Because Nutch is a small project this
  will help us definitely.


Thanks,
Sebastian



On 06/12/2018 08:46 AM, Arkadi.Kosmynin@csiro.au wrote:
> Hi guys,
> 
>  
> 
> I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 1.14 and Solr 7.2,
> and I have come across a few serious issues, of which you should be aware:
> 
>  
> 
> 1.       The Nutch-2071 is still an issue in 1.14, because the returned parseResult is never null.
> If a parser fails to parse a document, it returns an empty result, but not null. This means that,
> from a chain of parser candidates, only the first one has a chance to try to parse the document.
> 
> 2.       Nutch adopted Tika as a general parsing tool, and stopped supporting “legacy” parsing (OO,
> MS) plugins. I continued using them and hoped to stop supporting them in the next version of Arch I
> am preparing to be released, but I still can’t do it, because Tika fails to parse too many documents
> on our site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% parsing
> success rate. This is why NUTCH-2071 is important for Arch. I think you should bring back legacy
> parsers to Nutch, because the quality of parsing of “real life” data, such as ours, is not great
> without them.
> 
> 3.       The lines defining fall-back (*) plugin in parse-plugins.xml are not effective, because
> they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some
> cases, Nutch assigns * capability to plugins that don’t even claim it. For example, I can’t
> understand, why Arch content blocking plugin gets it.
> 
> 4.       In earlier versions of Nutch, use of the native libraries really helped. It reduced
> crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I don’t notice this. I’ve
> obtained Hadoop libraries, placed them where they are expected, even inserted an explicit load
> library call in my code, but I still don’t notice any significant time savings.
> 
> 5.       The Feed plugin seems to have a major problem. The line 102 in  FeedIndexingFilter.java
> generated a NumberFormatException (which caused the failure of the entire crawling process!) because
> it was trying to parse a date in string format, not a number. Given that this metadata piece was
> generated by the feed parser (same plugin), it seems that the plugin is in disagreement with itself.
> 
> 6.       This is less important, but when Tika fails to parse a document, it generates a scary error
> message and ugly stack trace. I think this should be a one line warning, because other parsers may
> still parse this document successfully.
> 
>  
> 
> Hope this helps.
> 
>  
> 
> Regards,
> 
>  
> 
> Arkadi
>