You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Calvin Yu <cs...@gmail.com> on 2007/06/01 17:20:15 UTC

Bad concurrency bug in 0.12.3?

I've been experiencing some issues where my mapred tasks have been
hanging after a lengthy period of execution.  I believe I've found the
problem and wanted to get other's thoughts about it.

The problem seems to be with the MapTask's (MapTask.java) sort
progress thread (line #196) not stopping after the sort is completed,
and hence the call to join() (line# 190) never returns.  This is
because that thread is only catching the InterruptedException, and not
checking the thread's interrupted flag as well.  According to the
Javadocs, an InterruptedException is thrown only if the Thread is in
the middle of the sleep(), wait(), join(), etc. calls, and during
normal operations only the interrupted flag is set.  Can someone
confirm this?  I'm going to patch my install to see if this is my
problem, but I seem to only run into this problem after several hours
of processing and would like to get earlier confirmation.

I did a search in JIRA and it looks like there are patches
(HADOOP-1431) that might inadvertently solve this problem, but didn't
see any one ticket that specifically details this scenario.

Calvin

Re: Bad concurrency bug in 0.12.3?

Posted by Owen O'Malley <oo...@yahoo-inc.com>.
I've looked over the code and it looks right. I like the  
InteruptedException for telling threads to stop. The only gotcha is  
that a lot of the old Hadoop code ignores InterruptedException. But  
looking at the code in that thread, there is only one handler and it  
re-interrupts the thread. So it should be fine. If you can get a  
stack trace, I would certainly like to see it.

One side note is that all of the servers have a servlet such that if  
you do http://<node>:<port>/stacks you'll get a stack trace of all  
the threads in the server. I find that useful for remote debugging.  
*smile* Although if it is a task jvm that has the problem, then there  
isn't a server for them.

-- Owen

Re: Bad concurrency bug in 0.12.3?

Posted by Calvin Yu <cs...@gmail.com>.
Here's a thread dump of the problem.  When I kicked off a job this
weekend, it actually completed.  I kicked off another one yesterday
and get the problem.

Calvin


On 6/1/07, Doug Cutting <cu...@apache.org> wrote:
> Calvin Yu wrote:
> > public class Test {
> >  public static void main(String[] args) {
> >    System.out.println("interrupting..");
> >    Thread.currentThread().interrupt();
> >    try {
> >      Thread.sleep(100);
> >      System.out.println("done.");
> >    } catch (InterruptedException e) {
> >      e.printStackTrace();
> >    }
> >  }
> > }
> >
> > Granted, this is an over-simplified test, and won't test for JVM bugs.
>
> Yes, but it does show that's probably the intended behavior: an
> interrupt should be sufficient, even if it doesn't arrive during the
> call to sleep().  We still don't know why join() hung, if it's a JVM
> bug, or if there's some bug in Hadoop.  In either case, I think we can
> defensively code this without the use of join().
>
> Doug
>

Re: Bad concurrency bug in 0.12.3?

Posted by Doug Cutting <cu...@apache.org>.
Calvin Yu wrote:
> public class Test {
>  public static void main(String[] args) {
>    System.out.println("interrupting..");
>    Thread.currentThread().interrupt();
>    try {
>      Thread.sleep(100);
>      System.out.println("done.");
>    } catch (InterruptedException e) {
>      e.printStackTrace();
>    }
>  }
> }
> 
> Granted, this is an over-simplified test, and won't test for JVM bugs.

Yes, but it does show that's probably the intended behavior: an 
interrupt should be sufficient, even if it doesn't arrive during the 
call to sleep().  We still don't know why join() hung, if it's a JVM 
bug, or if there's some bug in Hadoop.  In either case, I think we can 
defensively code this without the use of join().

Doug

Re: Bad concurrency bug in 0.12.3?

Posted by Calvin Yu <cs...@gmail.com>.
public class Test {
  public static void main(String[] args) {
    System.out.println("interrupting..");
    Thread.currentThread().interrupt();
    try {
      Thread.sleep(100);
      System.out.println("done.");
    } catch (InterruptedException e) {
      e.printStackTrace();
    }
  }
}

Granted, this is an over-simplified test, and won't test for JVM bugs.

Calvin


On 6/1/07, Doug Cutting <cu...@apache.org> wrote:
> Are you certain that interrupt() is called before sleep()?  If
> interrupt() is called during the sleep() then it should clearly throw
> the InterruptedException.  The question is whether it is thrown if the
> call to interrupt() precedes the call to sleep().  Please feel free to
> post your test program.  And, yes, the thread dump would be most welcome.
>
> Cheers,
>
> Doug
>
> Calvin Yu wrote:
> > You're right Doug, I ran a simple test to verify that interrupt() will
> > result in a InterruptedException on a call to sleep(), so my hang up
> > problem is something else.  I'm going to rerun my job and post a
> > thread dump of the hang up.
> >
> > Calvin
> >
> >
> > On 6/1/07, Doug Cutting <cu...@apache.org> wrote:
> >> Calvin Yu wrote:
> >> > The problem seems to be with the MapTask's (MapTask.java) sort
> >> > progress thread (line #196) not stopping after the sort is completed,
> >> > and hence the call to join() (line# 190) never returns.  This is
> >> > because that thread is only catching the InterruptedException, and not
> >> > checking the thread's interrupted flag as well.  According to the
> >> > Javadocs, an InterruptedException is thrown only if the Thread is in
> >> > the middle of the sleep(), wait(), join(), etc. calls, and during
> >> > normal operations only the interrupted flag is set.
> >>
> >> I think that, if a thread is interrupted, and its interrupt flag is set,
> >> and sleep() is called, then sleep() should immediately throw an
> >> InterruptedException.  That's what the javadoc implies to me:
> >>
> >>    Throws: InterruptedException - if another thread has interrupted
> >>    the current thread.
> >>
> >> So this could be a JVM bug, or perhaps that's not the contract.
> >>
> >> I think we should fix this as a part of HADOOP-1431.  We should change
> >> that to use the mechanism we use elsewhere.  We should have a 'running'
> >> flag that's checked in the thread's main loop, and method to stop the
> >> thread that sets this flag to false and interrupts it.  That works
> >> reliably in many places.
> >>
> >> Perhaps for the 0.14 release this logic should be abstracted into a base
> >> class for Daemon threads, so that we don't re-invent it everywhere.
> >>
> >> Doug
> >>
>

Re: Bad concurrency bug in 0.12.3?

Posted by Doug Cutting <cu...@apache.org>.
Are you certain that interrupt() is called before sleep()?  If 
interrupt() is called during the sleep() then it should clearly throw 
the InterruptedException.  The question is whether it is thrown if the 
call to interrupt() precedes the call to sleep().  Please feel free to 
post your test program.  And, yes, the thread dump would be most welcome.

Cheers,

Doug

Calvin Yu wrote:
> You're right Doug, I ran a simple test to verify that interrupt() will
> result in a InterruptedException on a call to sleep(), so my hang up
> problem is something else.  I'm going to rerun my job and post a
> thread dump of the hang up.
> 
> Calvin
> 
> 
> On 6/1/07, Doug Cutting <cu...@apache.org> wrote:
>> Calvin Yu wrote:
>> > The problem seems to be with the MapTask's (MapTask.java) sort
>> > progress thread (line #196) not stopping after the sort is completed,
>> > and hence the call to join() (line# 190) never returns.  This is
>> > because that thread is only catching the InterruptedException, and not
>> > checking the thread's interrupted flag as well.  According to the
>> > Javadocs, an InterruptedException is thrown only if the Thread is in
>> > the middle of the sleep(), wait(), join(), etc. calls, and during
>> > normal operations only the interrupted flag is set.
>>
>> I think that, if a thread is interrupted, and its interrupt flag is set,
>> and sleep() is called, then sleep() should immediately throw an
>> InterruptedException.  That's what the javadoc implies to me:
>>
>>    Throws: InterruptedException - if another thread has interrupted
>>    the current thread.
>>
>> So this could be a JVM bug, or perhaps that's not the contract.
>>
>> I think we should fix this as a part of HADOOP-1431.  We should change
>> that to use the mechanism we use elsewhere.  We should have a 'running'
>> flag that's checked in the thread's main loop, and method to stop the
>> thread that sets this flag to false and interrupts it.  That works
>> reliably in many places.
>>
>> Perhaps for the 0.14 release this logic should be abstracted into a base
>> class for Daemon threads, so that we don't re-invent it everywhere.
>>
>> Doug
>>

Re: Bad concurrency bug in 0.12.3?

Posted by Calvin Yu <cs...@gmail.com>.
You're right Doug, I ran a simple test to verify that interrupt() will
result in a InterruptedException on a call to sleep(), so my hang up
problem is something else.  I'm going to rerun my job and post a
thread dump of the hang up.

Calvin


On 6/1/07, Doug Cutting <cu...@apache.org> wrote:
> Calvin Yu wrote:
> > The problem seems to be with the MapTask's (MapTask.java) sort
> > progress thread (line #196) not stopping after the sort is completed,
> > and hence the call to join() (line# 190) never returns.  This is
> > because that thread is only catching the InterruptedException, and not
> > checking the thread's interrupted flag as well.  According to the
> > Javadocs, an InterruptedException is thrown only if the Thread is in
> > the middle of the sleep(), wait(), join(), etc. calls, and during
> > normal operations only the interrupted flag is set.
>
> I think that, if a thread is interrupted, and its interrupt flag is set,
> and sleep() is called, then sleep() should immediately throw an
> InterruptedException.  That's what the javadoc implies to me:
>
>    Throws: InterruptedException - if another thread has interrupted
>    the current thread.
>
> So this could be a JVM bug, or perhaps that's not the contract.
>
> I think we should fix this as a part of HADOOP-1431.  We should change
> that to use the mechanism we use elsewhere.  We should have a 'running'
> flag that's checked in the thread's main loop, and method to stop the
> thread that sets this flag to false and interrupts it.  That works
> reliably in many places.
>
> Perhaps for the 0.14 release this logic should be abstracted into a base
> class for Daemon threads, so that we don't re-invent it everywhere.
>
> Doug
>

Re: Bad concurrency bug in 0.12.3?

Posted by Doug Cutting <cu...@apache.org>.
Calvin Yu wrote:
> The problem seems to be with the MapTask's (MapTask.java) sort
> progress thread (line #196) not stopping after the sort is completed,
> and hence the call to join() (line# 190) never returns.  This is
> because that thread is only catching the InterruptedException, and not
> checking the thread's interrupted flag as well.  According to the
> Javadocs, an InterruptedException is thrown only if the Thread is in
> the middle of the sleep(), wait(), join(), etc. calls, and during
> normal operations only the interrupted flag is set.

I think that, if a thread is interrupted, and its interrupt flag is set, 
and sleep() is called, then sleep() should immediately throw an 
InterruptedException.  That's what the javadoc implies to me:

   Throws: InterruptedException - if another thread has interrupted
   the current thread.

So this could be a JVM bug, or perhaps that's not the contract.

I think we should fix this as a part of HADOOP-1431.  We should change 
that to use the mechanism we use elsewhere.  We should have a 'running' 
flag that's checked in the thread's main loop, and method to stop the 
thread that sets this flag to false and interrupts it.  That works 
reliably in many places.

Perhaps for the 0.14 release this logic should be abstracted into a base 
class for Daemon threads, so that we don't re-invent it everywhere.

Doug

RE: Bad concurrency bug in 0.12.3?

Posted by Devaraj Das <dd...@yahoo-inc.com>.
Looks like you have a good point! I think you are right.
Let me raise a jira to handle this issue more generally, i.e., fix all
places wherever this kind of check needs to be done.

-----Original Message-----
From: Calvin Yu [mailto:csyu77@gmail.com] 
Sent: Friday, June 01, 2007 8:50 PM
To: hadoop-user@lucene.apache.org
Subject: Bad concurrency bug in 0.12.3?

I've been experiencing some issues where my mapred tasks have been hanging
after a lengthy period of execution.  I believe I've found the problem and
wanted to get other's thoughts about it.

The problem seems to be with the MapTask's (MapTask.java) sort progress
thread (line #196) not stopping after the sort is completed, and hence the
call to join() (line# 190) never returns.  This is because that thread is
only catching the InterruptedException, and not checking the thread's
interrupted flag as well.  According to the Javadocs, an
InterruptedException is thrown only if the Thread is in the middle of the
sleep(), wait(), join(), etc. calls, and during normal operations only the
interrupted flag is set.  Can someone confirm this?  I'm going to patch my
install to see if this is my problem, but I seem to only run into this
problem after several hours of processing and would like to get earlier
confirmation.

I did a search in JIRA and it looks like there are patches
(HADOOP-1431) that might inadvertently solve this problem, but didn't see
any one ticket that specifically details this scenario.

Calvin