You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Michael Coffey (JIRA)" <ji...@apache.org> on 2017/05/03 17:49:04 UTC

[jira] [Commented] (NUTCH-2379) crawl script dedup's crawldb update is slow

    [ https://issues.apache.org/jira/browse/NUTCH-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995309#comment-15995309 ] 

Michael Coffey commented on NUTCH-2379:
---------------------------------------

In my private version, I provided $commonOptions as an argument to dedup, and that made it run faster.

I did not check to see which of the options made the difference.

Should we have an issue for reviewing commonOptions and its usage in the crawl script? The invertlinks step (which also does not use $commonOptions) can be sped up by setting an appropriate number of reduces. I am not a fan of $commonOptions in its current form, as it seems like some of the options should be different for different steps of the crawl.

> crawl script dedup's crawldb update is slow 
> --------------------------------------------
>
>                 Key: NUTCH-2379
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2379
>             Project: Nutch
>          Issue Type: Bug
>          Components: bin
>    Affects Versions: 1.11
>         Environment: shell
>            Reporter: Michael Coffey
>            Priority: Minor
>
>  In the standard crawl script, there is a _bin_nutch updatedb command and, soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db" in their names (in addition to the actual deduplication job).
> In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched by updatedb.
> I notice that the script passes $commonOptions to updatedb but not to dedup. I suspect that the crawldb update launched by dedup may not be compressing its output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)