You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2017/04/28 00:54:37 UTC

crawlDb speed around deduplication

In the standard crawl script, there is a _bin_nutch updatedb command and, soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db" in their names (in addition to the actual deduplication job).
In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched by updatedb. Why should that be? Is it doing something different?
I notice that the script passes $commonOptions to updatedb but not to dedup.

Re: crawlDb speed around deduplication

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Michael,

both "crawldb" jobs are similar - they merge status information into the CrawlDb,
fetch status and newly found links resp. detected duplicates. There are two situations where
I could think of the second job takes longer:
 - if there are many duplicates, significantly more than status updates and additions in the
   preceding updatedb job
 - if the CrawlDb has grown significantly (the preceding updatedb added many new URLs)

But you're right. I can see no reason why $commonOptions is not used for the dedup job.
Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should be also
checked for the other jobs which are not run with $commonOptions.
If possible, please test whether running the dedup job with the common options fixes your problem.
That's easily done: just edit src/bin/crawl and run "ant runtime".

Thanks,
Sebastian


On 04/28/2017 02:54 AM, Michael Coffey wrote:
> In the standard crawl script, there is a _bin_nutch updatedb command and, soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db" in their names (in addition to the actual deduplication job).
> In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched by updatedb. Why should that be? Is it doing something different?
> I notice that the script passes $commonOptions to updatedb but not to dedup.
>