You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Chaushu, Shani" <sh...@intel.com> on 2014/12/28 13:35:05 UTC

Nutch stopped after 5 segments

Hi all,
 I ran Nutch on distributed Hadoop (1 master node and 3 workers).
It ran over ~10k links, 8 hours, numberOfRounds=30 and it stopped after the 5th segment, and I know there is more than depth 5 because when I ran on one of the urls it created much more segments.
The last segment folder contained only folders: content, crawl_fetch, crawl_generate.
The last job that run was the fetcher - no errors in the log and all complete success, I don't know what it didn't proceed to the next depth.

Anyone have an idea why it stopped? And how can I know for sure?
Thanks.




---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

RE: Nutch stopped after 5 segments

Posted by "Chaushu, Shani" <sh...@intel.com>.
But it works perfect when I ran it on small number of links, I run the command nutch/crawl that should handle the hold process.. 
I thought it may be related to configuration that I can't find..
When I ran on 2K links it stopped after 7 iteration, on 10K it stopped after 5 iteration, and on 1 page it worked perfect.


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Sunday, December 28, 2014 14:38
To: user@nutch.apache.org
Subject: RE: Nutch stopped after 5 segments

The segment isn't parsed and didn't write its hyperlinks back to the DB. Parse the segment and then updatedb it.  
 
-----Original message-----
> From:Chaushu, Shani <sh...@intel.com>
> Sent: Sunday 28th December 2014 13:35
> To: user@nutch.apache.org
> Subject: Nutch stopped after 5 segments
> 
> Hi all,
>  I ran Nutch on distributed Hadoop (1 master node and 3 workers).
> It ran over ~10k links, 8 hours, numberOfRounds=30 and it stopped after the 5th segment, and I know there is more than depth 5 because when I ran on one of the urls it created much more segments.
> The last segment folder contained only folders: content, crawl_fetch, crawl_generate.
> The last job that run was the fetcher - no errors in the log and all complete success, I don't know what it didn't proceed to the next depth.
> 
> Anyone have an idea why it stopped? And how can I know for sure?
> Thanks.
> 
> 
> 
> 
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for 
> the sole use of the intended recipient(s). Any review or distribution 
> by others is strictly prohibited. If you are not the intended 
> recipient, please contact the sender and delete all copies.
> 
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

RE: Nutch stopped after 5 segments

Posted by Markus Jelsma <ma...@openindex.io>.
The segment isn't parsed and didn't write its hyperlinks back to the DB. Parse the segment and then updatedb it.  
 
-----Original message-----
> From:Chaushu, Shani <sh...@intel.com>
> Sent: Sunday 28th December 2014 13:35
> To: user@nutch.apache.org
> Subject: Nutch stopped after 5 segments
> 
> Hi all,
>  I ran Nutch on distributed Hadoop (1 master node and 3 workers).
> It ran over ~10k links, 8 hours, numberOfRounds=30 and it stopped after the 5th segment, and I know there is more than depth 5 because when I ran on one of the urls it created much more segments.
> The last segment folder contained only folders: content, crawl_fetch, crawl_generate.
> The last job that run was the fetcher - no errors in the log and all complete success, I don't know what it didn't proceed to the next depth.
> 
> Anyone have an idea why it stopped? And how can I know for sure?
> Thanks.
> 
> 
> 
> 
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>