You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jason Venner <ja...@attributor.com> on 2008/09/01 15:38:27 UTC

Re: Timeouts at reduce stage

We have trouble with that also, particularly when we have JMX enabled in 
our jobs.
We have modified the /main/ that launches the children of the task 
tracker to explicity exit, in it's finally block. That helps substantially.

We also have some jobs that do not seem to be killable by the 
Process.destroy method, we suspect to badly behaved external libraries 
being used via JNI.
This is in 0.16.

Иван wrote:
> Thank you, this suggestion seems to be very close to the real situation. The cluster have already been left looping such a (relatively) frequently failing mapreduce jobs over a huge period of time to produce a more clear picture of the problem. And I've tried to investigate this suggestion more closely when I've read it. After taking a look at Ganglia monitoring system that's running on that same cluster it became clear that cluster's computing resources apparently are exhausted. Further step was quite simple and straightforward - just to login to the one random node and find out the consumer of server's resources. The answer became clear almost instantly because top and jps commands offered produced a huge list of orphaned TaskTracker$Child processes consuming tons of CPU time and RAM (in fact, almost all of them). Some other nodes even have run out of 16G RAM and few GB of swap and stopped responding at all. 
>
> This situation apparently doesn't seems normal, I am going to try to repeat such a test with some simpler jobs (I think it would be something from Hadoop distribution to make sure that everything is fine with code) to find out more definitely whether this orphaning of forked processes depends on exact MR job running or not (theoretically it still could be something wrong with Hadoop/HBase configuration or even maybe with operating system, some additional installed software or, as it was suggested earlier, hardware).
>
> I would be glad if someone could help me in this process by some advice (googling on this topic already proved to be hard because of $ being treated as separator and lookup usually results in materials about real childs). Maybe this situation is quite common and there is a definite reason or solution?  
>
> Thanks!
>
> Ivan Blinkov
>
> -----Original Message-----
> From: Karl Anderson <kr...@monkey.org>
> To: core-user@hadoop.apache.org
> Date: Fri, 29 Aug 2008 13:17:18 -0700
> Subject: Re: Timeouts at reduce stage
>
>   
>> On 29-Aug-08, at 3:53 AM, Иван wrote:
>>
>>     
>>> Thanks for a fast reply, but in fact it sometimes fails even on  
>>> default MR jobs like, for example, rowcounter job from HBase 0.2.0  
>>> distribution. Hardware problems are theoretically possible, but they  
>>> doesn't seem to be the case because everything else is operating  
>>> fine on the same set of servers. It seems that all major components  
>>> of each server are fine, even disk arrays are regularly checked by  
>>> datacenter stuff.
>>>       
>> It could be due to a resource problem, I've found these hard to debug  
>> at times.  Tasks or parts of the framework can fail due to other tasks  
>> using up resources, and sometimes the errors you see don't make the  
>> cause easy to find.  I've had memory consumption in a mapper cause  
>> errors in other mappers, reducers, and fetching HDFS blocks, as well  
>> as job infrastructure failures that I don't really understand (for  
>> example, one task unable to find a file that was put in a job jar and  
>> found by other tasks).  I think all of my timeouts have been  
>> straightforward, but I could imagine resource consumption causing that  
>> in an otherwise unrelated task - IO blocking, swap, etc.
>>
>>     
>
>

Re: Timeouts at reduce stage

Posted by 叶双明 <ye...@gmail.com>.

:)

2008/9/5, Doug Cutting <cu...@apache.org>:
>
> Jason Venner wrote:
>
>> We have modified the /main/ that launches the children of the task tracker
>> to explicity exit, in it's finally block. That helps substantially.
>>
>
> Have you submitted this as a patch?
>
> Doug
>

Re: Timeouts at reduce stage

Posted by Doug Cutting <cu...@apache.org>.

Jason Venner wrote:
> We have modified the /main/ that launches the children of the task 
> tracker to explicity exit, in it's finally block. That helps substantially.

Have you submitted this as a patch?

Doug