You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Ion Badita <io...@mcr.ro> on 2007/03/15 06:42:25 UTC

Stalled M/R task

Hi,

My configuration: hadoop 0.12.0, 19 computers, jdk 6.

I ran a m/r task with 133 maps and one reduce. One of the nodes was 
reported as "Lost task tracker" and the task stall at 99% on map and 32% 
on reduce. It stayed like this for hours with no activity. The tasks 
from the lost task tracker was not moved to another TT.
In the console from where i start the task i saw this stack trace:


07/03/14 21:33:30 INFO mapred.JobClient:  map 98% reduce 22%
07/03/14 21:34:20 INFO mapred.JobClient:  map 98% reduce 23%
07/03/14 21:34:30 INFO mapred.JobClient:  map 98% reduce 24%
07/03/14 21:35:13 INFO mapred.JobClient: Task Id : task_0007_m_000034_0, 
Status : FAILED
07/03/14 21:35:13 INFO mapred.JobClient: Communication problem with 
server: java.net.MalformedURLException: no protocol: null&filter=stdout
        at java.net.URL.<init>(URL.java:567)
        at java.net.URL.<init>(URL.java:464)
        at java.net.URL.<init>(URL.java:413)
        at 
org.apache.hadoop.mapred.JobClient.displayTaskLogs(JobClient.java:621)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:583)
        at com.genzen.crawler.utils.ToolBase.runJob(ToolBase.java:103)
        at com.genzen.crawler.indexer.Indexer.doMain(Indexer.java:87)
        at com.genzen.crawler.utils.ToolBase.doMain0(ToolBase.java:96)
        at com.genzen.crawler.utils.ToolBase.execute(ToolBase.java:117)
        at com.genzen.crawler.indexer.Indexer.main(Indexer.java:92)

07/03/14 21:35:21 INFO mapred.JobClient:  map 97% reduce 24%
07/03/14 21:35:23 INFO mapred.JobClient:  map 98% reduce 24%
07/03/14 21:35:47 INFO mapred.JobClient:  map 99% reduce 24%
07/03/14 21:37:50 INFO mapred.JobClient:  map 99% reduce 25%
07/03/14 21:38:20 INFO mapred.JobClient:  map 99% reduce 26%
07/03/14 21:38:40 INFO mapred.JobClient:  map 99% reduce 27%
07/03/14 21:39:11 INFO mapred.JobClient:  map 99% reduce 28%
07/03/14 21:39:41 INFO mapred.JobClient:  map 99% reduce 29%
07/03/14 21:40:11 INFO mapred.JobClient:  map 99% reduce 30%
07/03/14 21:40:30 INFO mapred.JobClient:  map 99% reduce 31%
07/03/14 21:41:11 INFO mapred.JobClient:  map 99% reduce 32%


I had task trackers crash in the past with the same configuration. Some 
of them got rescheduled on different machines, other don't and because 
of that the hole m/r never recovered from this.

This is a "simulation" of an real environment, where computers crash.

Any help will be appreciated.


John



Re: Stalled M/R task

Posted by Andrzej Bialecki <ab...@getopt.org>.
Devaraj Das wrote:
> Maybe you should upgrade to the current trunk or wait for 0.12.1 to get
> released (won't be long). There are major fixes in the trunk and I suspect
> you may be hitting the bug (Hadoop-1060).
>   

I had a similar situation with 0.10.1, and as a temporary workaround if 
I shut down the offending tasktracker it usually helped the job to complete.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Stalled M/R task

Posted by Devaraj Das <dd...@yahoo-inc.com>.
Maybe you should upgrade to the current trunk or wait for 0.12.1 to get
released (won't be long). There are major fixes in the trunk and I suspect
you may be hitting the bug (Hadoop-1060).

> -----Original Message-----
> From: Ion Badita [mailto:ion.badita@mcr.ro]
> Sent: Thursday, March 15, 2007 11:12 AM
> To: hadoop-user@lucene.apache.org
> Subject: Stalled M/R task
> 
> Hi,
> 
> My configuration: hadoop 0.12.0, 19 computers, jdk 6.
> 
> I ran a m/r task with 133 maps and one reduce. One of the nodes was
> reported as "Lost task tracker" and the task stall at 99% on map and 32%
> on reduce. It stayed like this for hours with no activity. The tasks
> from the lost task tracker was not moved to another TT.
> In the console from where i start the task i saw this stack trace:
> 
> 
> 07/03/14 21:33:30 INFO mapred.JobClient:  map 98% reduce 22%
> 07/03/14 21:34:20 INFO mapred.JobClient:  map 98% reduce 23%
> 07/03/14 21:34:30 INFO mapred.JobClient:  map 98% reduce 24%
> 07/03/14 21:35:13 INFO mapred.JobClient: Task Id : task_0007_m_000034_0,
> Status : FAILED
> 07/03/14 21:35:13 INFO mapred.JobClient: Communication problem with
> server: java.net.MalformedURLException: no protocol: null&filter=stdout
>         at java.net.URL.<init>(URL.java:567)
>         at java.net.URL.<init>(URL.java:464)
>         at java.net.URL.<init>(URL.java:413)
>         at
> org.apache.hadoop.mapred.JobClient.displayTaskLogs(JobClient.java:621)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:583)
>         at com.genzen.crawler.utils.ToolBase.runJob(ToolBase.java:103)
>         at com.genzen.crawler.indexer.Indexer.doMain(Indexer.java:87)
>         at com.genzen.crawler.utils.ToolBase.doMain0(ToolBase.java:96)
>         at com.genzen.crawler.utils.ToolBase.execute(ToolBase.java:117)
>         at com.genzen.crawler.indexer.Indexer.main(Indexer.java:92)
> 
> 07/03/14 21:35:21 INFO mapred.JobClient:  map 97% reduce 24%
> 07/03/14 21:35:23 INFO mapred.JobClient:  map 98% reduce 24%
> 07/03/14 21:35:47 INFO mapred.JobClient:  map 99% reduce 24%
> 07/03/14 21:37:50 INFO mapred.JobClient:  map 99% reduce 25%
> 07/03/14 21:38:20 INFO mapred.JobClient:  map 99% reduce 26%
> 07/03/14 21:38:40 INFO mapred.JobClient:  map 99% reduce 27%
> 07/03/14 21:39:11 INFO mapred.JobClient:  map 99% reduce 28%
> 07/03/14 21:39:41 INFO mapred.JobClient:  map 99% reduce 29%
> 07/03/14 21:40:11 INFO mapred.JobClient:  map 99% reduce 30%
> 07/03/14 21:40:30 INFO mapred.JobClient:  map 99% reduce 31%
> 07/03/14 21:41:11 INFO mapred.JobClient:  map 99% reduce 32%
> 
> 
> I had task trackers crash in the past with the same configuration. Some
> of them got rescheduled on different machines, other don't and because
> of that the hole m/r never recovered from this.
> 
> This is a "simulation" of an real environment, where computers crash.
> 
> Any help will be appreciated.
> 
> 
> John
>