You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Calvin Yu <cs...@gmail.com> on 2007/06/01 17:20:15 UTC
Bad concurrency bug in 0.12.3?
I've been experiencing some issues where my mapred tasks have been
hanging after a lengthy period of execution. I believe I've found the
problem and wanted to get other's thoughts about it.
The problem seems to be with the MapTask's (MapTask.java) sort
progress thread (line #196) not stopping after the sort is completed,
and hence the call to join() (line# 190) never returns. This is
because that thread is only catching the InterruptedException, and not
checking the thread's interrupted flag as well. According to the
Javadocs, an InterruptedException is thrown only if the Thread is in
the middle of the sleep(), wait(), join(), etc. calls, and during
normal operations only the interrupted flag is set. Can someone
confirm this? I'm going to patch my install to see if this is my
problem, but I seem to only run into this problem after several hours
of processing and would like to get earlier confirmation.
I did a search in JIRA and it looks like there are patches
(HADOOP-1431) that might inadvertently solve this problem, but didn't
see any one ticket that specifically details this scenario.
Calvin
Re: Bad concurrency bug in 0.12.3?
Posted by Owen O'Malley <oo...@yahoo-inc.com>.
I've looked over the code and it looks right. I like the
InteruptedException for telling threads to stop. The only gotcha is
that a lot of the old Hadoop code ignores InterruptedException. But
looking at the code in that thread, there is only one handler and it
re-interrupts the thread. So it should be fine. If you can get a
stack trace, I would certainly like to see it.
One side note is that all of the servers have a servlet such that if
you do http://<node>:<port>/stacks you'll get a stack trace of all
the threads in the server. I find that useful for remote debugging.
*smile* Although if it is a task jvm that has the problem, then there
isn't a server for them.
-- Owen
Re: Bad concurrency bug in 0.12.3?
Posted by Calvin Yu <cs...@gmail.com>.
Here's a thread dump of the problem. When I kicked off a job this
weekend, it actually completed. I kicked off another one yesterday
and get the problem.
Calvin
On 6/1/07, Doug Cutting <cu...@apache.org> wrote:
> Calvin Yu wrote:
> > public class Test {
> > public static void main(String[] args) {
> > System.out.println("interrupting..");
> > Thread.currentThread().interrupt();
> > try {
> > Thread.sleep(100);
> > System.out.println("done.");
> > } catch (InterruptedException e) {
> > e.printStackTrace();
> > }
> > }
> > }
> >
> > Granted, this is an over-simplified test, and won't test for JVM bugs.
>
> Yes, but it does show that's probably the intended behavior: an
> interrupt should be sufficient, even if it doesn't arrive during the
> call to sleep(). We still don't know why join() hung, if it's a JVM
> bug, or if there's some bug in Hadoop. In either case, I think we can
> defensively code this without the use of join().
>
> Doug
>
Re: Bad concurrency bug in 0.12.3?
Posted by Doug Cutting <cu...@apache.org>.
Calvin Yu wrote:
> public class Test {
> public static void main(String[] args) {
> System.out.println("interrupting..");
> Thread.currentThread().interrupt();
> try {
> Thread.sleep(100);
> System.out.println("done.");
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> }
> }
>
> Granted, this is an over-simplified test, and won't test for JVM bugs.
Yes, but it does show that's probably the intended behavior: an
interrupt should be sufficient, even if it doesn't arrive during the
call to sleep(). We still don't know why join() hung, if it's a JVM
bug, or if there's some bug in Hadoop. In either case, I think we can
defensively code this without the use of join().
Doug
Re: Bad concurrency bug in 0.12.3?
Posted by Calvin Yu <cs...@gmail.com>.
public class Test {
public static void main(String[] args) {
System.out.println("interrupting..");
Thread.currentThread().interrupt();
try {
Thread.sleep(100);
System.out.println("done.");
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
Granted, this is an over-simplified test, and won't test for JVM bugs.
Calvin
On 6/1/07, Doug Cutting <cu...@apache.org> wrote:
> Are you certain that interrupt() is called before sleep()? If
> interrupt() is called during the sleep() then it should clearly throw
> the InterruptedException. The question is whether it is thrown if the
> call to interrupt() precedes the call to sleep(). Please feel free to
> post your test program. And, yes, the thread dump would be most welcome.
>
> Cheers,
>
> Doug
>
> Calvin Yu wrote:
> > You're right Doug, I ran a simple test to verify that interrupt() will
> > result in a InterruptedException on a call to sleep(), so my hang up
> > problem is something else. I'm going to rerun my job and post a
> > thread dump of the hang up.
> >
> > Calvin
> >
> >
> > On 6/1/07, Doug Cutting <cu...@apache.org> wrote:
> >> Calvin Yu wrote:
> >> > The problem seems to be with the MapTask's (MapTask.java) sort
> >> > progress thread (line #196) not stopping after the sort is completed,
> >> > and hence the call to join() (line# 190) never returns. This is
> >> > because that thread is only catching the InterruptedException, and not
> >> > checking the thread's interrupted flag as well. According to the
> >> > Javadocs, an InterruptedException is thrown only if the Thread is in
> >> > the middle of the sleep(), wait(), join(), etc. calls, and during
> >> > normal operations only the interrupted flag is set.
> >>
> >> I think that, if a thread is interrupted, and its interrupt flag is set,
> >> and sleep() is called, then sleep() should immediately throw an
> >> InterruptedException. That's what the javadoc implies to me:
> >>
> >> Throws: InterruptedException - if another thread has interrupted
> >> the current thread.
> >>
> >> So this could be a JVM bug, or perhaps that's not the contract.
> >>
> >> I think we should fix this as a part of HADOOP-1431. We should change
> >> that to use the mechanism we use elsewhere. We should have a 'running'
> >> flag that's checked in the thread's main loop, and method to stop the
> >> thread that sets this flag to false and interrupts it. That works
> >> reliably in many places.
> >>
> >> Perhaps for the 0.14 release this logic should be abstracted into a base
> >> class for Daemon threads, so that we don't re-invent it everywhere.
> >>
> >> Doug
> >>
>
Re: Bad concurrency bug in 0.12.3?
Posted by Doug Cutting <cu...@apache.org>.
Are you certain that interrupt() is called before sleep()? If
interrupt() is called during the sleep() then it should clearly throw
the InterruptedException. The question is whether it is thrown if the
call to interrupt() precedes the call to sleep(). Please feel free to
post your test program. And, yes, the thread dump would be most welcome.
Cheers,
Doug
Calvin Yu wrote:
> You're right Doug, I ran a simple test to verify that interrupt() will
> result in a InterruptedException on a call to sleep(), so my hang up
> problem is something else. I'm going to rerun my job and post a
> thread dump of the hang up.
>
> Calvin
>
>
> On 6/1/07, Doug Cutting <cu...@apache.org> wrote:
>> Calvin Yu wrote:
>> > The problem seems to be with the MapTask's (MapTask.java) sort
>> > progress thread (line #196) not stopping after the sort is completed,
>> > and hence the call to join() (line# 190) never returns. This is
>> > because that thread is only catching the InterruptedException, and not
>> > checking the thread's interrupted flag as well. According to the
>> > Javadocs, an InterruptedException is thrown only if the Thread is in
>> > the middle of the sleep(), wait(), join(), etc. calls, and during
>> > normal operations only the interrupted flag is set.
>>
>> I think that, if a thread is interrupted, and its interrupt flag is set,
>> and sleep() is called, then sleep() should immediately throw an
>> InterruptedException. That's what the javadoc implies to me:
>>
>> Throws: InterruptedException - if another thread has interrupted
>> the current thread.
>>
>> So this could be a JVM bug, or perhaps that's not the contract.
>>
>> I think we should fix this as a part of HADOOP-1431. We should change
>> that to use the mechanism we use elsewhere. We should have a 'running'
>> flag that's checked in the thread's main loop, and method to stop the
>> thread that sets this flag to false and interrupts it. That works
>> reliably in many places.
>>
>> Perhaps for the 0.14 release this logic should be abstracted into a base
>> class for Daemon threads, so that we don't re-invent it everywhere.
>>
>> Doug
>>
Re: Bad concurrency bug in 0.12.3?
Posted by Calvin Yu <cs...@gmail.com>.
You're right Doug, I ran a simple test to verify that interrupt() will
result in a InterruptedException on a call to sleep(), so my hang up
problem is something else. I'm going to rerun my job and post a
thread dump of the hang up.
Calvin
On 6/1/07, Doug Cutting <cu...@apache.org> wrote:
> Calvin Yu wrote:
> > The problem seems to be with the MapTask's (MapTask.java) sort
> > progress thread (line #196) not stopping after the sort is completed,
> > and hence the call to join() (line# 190) never returns. This is
> > because that thread is only catching the InterruptedException, and not
> > checking the thread's interrupted flag as well. According to the
> > Javadocs, an InterruptedException is thrown only if the Thread is in
> > the middle of the sleep(), wait(), join(), etc. calls, and during
> > normal operations only the interrupted flag is set.
>
> I think that, if a thread is interrupted, and its interrupt flag is set,
> and sleep() is called, then sleep() should immediately throw an
> InterruptedException. That's what the javadoc implies to me:
>
> Throws: InterruptedException - if another thread has interrupted
> the current thread.
>
> So this could be a JVM bug, or perhaps that's not the contract.
>
> I think we should fix this as a part of HADOOP-1431. We should change
> that to use the mechanism we use elsewhere. We should have a 'running'
> flag that's checked in the thread's main loop, and method to stop the
> thread that sets this flag to false and interrupts it. That works
> reliably in many places.
>
> Perhaps for the 0.14 release this logic should be abstracted into a base
> class for Daemon threads, so that we don't re-invent it everywhere.
>
> Doug
>
Re: Bad concurrency bug in 0.12.3?
Posted by Doug Cutting <cu...@apache.org>.
Calvin Yu wrote:
> The problem seems to be with the MapTask's (MapTask.java) sort
> progress thread (line #196) not stopping after the sort is completed,
> and hence the call to join() (line# 190) never returns. This is
> because that thread is only catching the InterruptedException, and not
> checking the thread's interrupted flag as well. According to the
> Javadocs, an InterruptedException is thrown only if the Thread is in
> the middle of the sleep(), wait(), join(), etc. calls, and during
> normal operations only the interrupted flag is set.
I think that, if a thread is interrupted, and its interrupt flag is set,
and sleep() is called, then sleep() should immediately throw an
InterruptedException. That's what the javadoc implies to me:
Throws: InterruptedException - if another thread has interrupted
the current thread.
So this could be a JVM bug, or perhaps that's not the contract.
I think we should fix this as a part of HADOOP-1431. We should change
that to use the mechanism we use elsewhere. We should have a 'running'
flag that's checked in the thread's main loop, and method to stop the
thread that sets this flag to false and interrupts it. That works
reliably in many places.
Perhaps for the 0.14 release this logic should be abstracted into a base
class for Daemon threads, so that we don't re-invent it everywhere.
Doug
RE: Bad concurrency bug in 0.12.3?
Posted by Devaraj Das <dd...@yahoo-inc.com>.
Looks like you have a good point! I think you are right.
Let me raise a jira to handle this issue more generally, i.e., fix all
places wherever this kind of check needs to be done.
-----Original Message-----
From: Calvin Yu [mailto:csyu77@gmail.com]
Sent: Friday, June 01, 2007 8:50 PM
To: hadoop-user@lucene.apache.org
Subject: Bad concurrency bug in 0.12.3?
I've been experiencing some issues where my mapred tasks have been hanging
after a lengthy period of execution. I believe I've found the problem and
wanted to get other's thoughts about it.
The problem seems to be with the MapTask's (MapTask.java) sort progress
thread (line #196) not stopping after the sort is completed, and hence the
call to join() (line# 190) never returns. This is because that thread is
only catching the InterruptedException, and not checking the thread's
interrupted flag as well. According to the Javadocs, an
InterruptedException is thrown only if the Thread is in the middle of the
sleep(), wait(), join(), etc. calls, and during normal operations only the
interrupted flag is set. Can someone confirm this? I'm going to patch my
install to see if this is my problem, but I seem to only run into this
problem after several hours of processing and would like to get earlier
confirmation.
I did a search in JIRA and it looks like there are patches
(HADOOP-1431) that might inadvertently solve this problem, but didn't see
any one ticket that specifically details this scenario.
Calvin