You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2017/01/31 21:06:28 UTC

[ANNOUNCE] New Nutch committer and PMC - Furkan Kamaci

Dear all,

it is my pleasure to announce that Furkan Kamac\u0131 has joined
the Nutch team as committer and PMC member. Furkan, please
feel free to introduce yourself and to tell the Nutch community
about your interests and your relation to Nutch.

Congratulations and welcome on board!

Regards,
Sebastian (on behalf of the Nutch PMC)

Re: crawlDb speed around deduplication

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Michael,

the easiest way is probably to check the actual job configuration as shown by the Hadoop resource
manager webapp, see screenshot. It's also indicated from where a configuration property is set.

Best,
Sebastian

On 05/02/2017 12:57 AM, Michael Coffey wrote:
> Thanks, I will do some testing with $commonOptions applied to dedup. I suspect that the dedup-update is not compressing its output. Any easy way to check for just that?
> 
> 
> 
> Hi Michael, both "crawldb" jobs are similar - they merge status information into the CrawlDb,fetch status and newly found links resp. detected duplicates. There are two situations where
> I could think of the second job takes longer: - if there are many duplicates, significantly more than status updates and additions in the preceding updatedb job - if the CrawlDb has grown significantly (the preceding updatedb added many new URLs) But you're right. I can see no reason why $commonOptions is not used for the dedup job.
> Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should be also
> checked for the other jobs which are not run with $commonOptions.
> If possible, please test whether running the dedup job with the common options fixes your
> problem.
> That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks,
> Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote:
>> In the standard crawl script, there is a _bin_nutch updatedb command and, soon after
> that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db"
> in their names (in addition to the actual deduplication job).
>> In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched
> by updatedb. Why should that be? Is it doing something different?
>> I notice that the script passes $commonOptions to updatedb but not to dedup.
>>

Re: crawlDb speed around deduplication

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.

Thanks, I will do some testing with $commonOptions applied to dedup. I suspect that the dedup-update is not compressing its output. Any easy way to check for just that?



Hi Michael, both "crawldb" jobs are similar - they merge status information into the CrawlDb,fetch status and newly found links resp. detected duplicates. There are two situations where
I could think of the second job takes longer: - if there are many duplicates, significantly more than status updates and additions in the preceding updatedb job - if the CrawlDb has grown significantly (the preceding updatedb added many new URLs) But you're right. I can see no reason why $commonOptions is not used for the dedup job.
Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should be also
checked for the other jobs which are not run with $commonOptions.
If possible, please test whether running the dedup job with the common options fixes your
problem.
That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks,
Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote:
> In the standard crawl script, there is a _bin_nutch updatedb command and, soon after
that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db"
in their names (in addition to the actual deduplication job).
> In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched
by updatedb. Why should that be? Is it doing something different?
> I notice that the script passes $commonOptions to updatedb but not to dedup.
>