You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by vivekvl <vi...@yahoo.com> on 2013/05/14 12:19:54 UTC

What would happen when Hadoop tasktracker and data node fails during Nutch Crawl?

I am in process of setting up production ready environment for Nutch Crawler.
Trying to make the environment fault tolerant to Hadoop node failure,
typically tasktracker and datanode failing together due to network issue or
crashing OS.

I tried simulating the scenario by stopping one node during a crawl process.
I stopped the node which was running a fetch reducer task in 5th cycle. The
task got completed after hanging for few minutes. The Namenode UI and Map
Reduce admin UI started showing reduced number for nodes. The crawl process
continued for the configured 6 cycles and ended. However the total number of
URLs crawled was lesser when compared with previous results. I suspect the
interrupted fetch task was never retried.

I want to understand the behavior and find solution for node failure during
crawl. I welcome suggestions on this.

I am using Nutch 2.1 with HBase 0.90.6 and Hadoop-0.20.2.

Thanks,
Raja



--
View this message in context: http://lucene.472066.n3.nabble.com/What-would-happen-when-Hadoop-tasktracker-and-data-node-fails-during-Nutch-Crawl-tp4063189.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: What would happen when Hadoop tasktracker and data node fails during Nutch Crawl?

Posted by feng lu <am...@gmail.com>.
Hi vivekvl

I see that if tasktracker node failed, you can use bin/fetch resume option
to resume interrupted job. [0] . and nutch 2.x will no use HDFS to store
any data. so data node failure will not effect the crawl.

[0] http://wiki.apache.org/nutch/bin/nutch%20fetch


On Tue, May 14, 2013 at 6:19 PM, vivekvl <vi...@yahoo.com> wrote:

> I am in process of setting up production ready environment for Nutch
> Crawler.
> Trying to make the environment fault tolerant to Hadoop node failure,
> typically tasktracker and datanode failing together due to network issue or
> crashing OS.
>
> I tried simulating the scenario by stopping one node during a crawl
> process.
> I stopped the node which was running a fetch reducer task in 5th cycle. The
> task got completed after hanging for few minutes. The Namenode UI and Map
> Reduce admin UI started showing reduced number for nodes. The crawl process
> continued for the configured 6 cycles and ended. However the total number
> of
> URLs crawled was lesser when compared with previous results. I suspect the
> interrupted fetch task was never retried.
>
> I want to understand the behavior and find solution for node failure during
> crawl. I welcome suggestions on this.
>
> I am using Nutch 2.1 with HBase 0.90.6 and Hadoop-0.20.2.
>
> Thanks,
> Raja
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/What-would-happen-when-Hadoop-tasktracker-and-data-node-fails-during-Nutch-Crawl-tp4063189.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)