You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Sebastian Schelter <ss...@apache.org> on 2013/01/21 13:59:30 UTC

Deadlock when running on Hadoop 1.0.4

Hi,

I'm testing a custom PageRank implementation using trunk on Hadoop
1.0.4. I seem to run into a deadlock after the input superstep.

The workers report "finishSuperstep: (all workers done) WORKER_ONLY -
Attempt=0, Superstep=0" and the master reports that all workers are done
with superstep -1.

I reconstructed this using a local setup and seems to me that the
BspServiceMaster hangs in the cleanUpZooKeeper method infinitely.

I'm not using a dedicated zk instance, I just have Giraph start one. Any
ideas what can be done to fix my problem?

Best,
Sebastian


excerpt from jstack

"org.apache.giraph.master.MasterThread" prio=10 tid=0x00007f29fc385000
nid=0x29d1 waiting on condition [0x00007f2a09a5f000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f38967d8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
        at
org.apache.giraph.zk.PredicateLock.waitMsecs(PredicateLock.java:112)
        at
org.apache.giraph.zk.PredicateLock.waitForever(PredicateLock.java:138)
        at
org.apache.giraph.master.BspServiceMaster.cleanUpZooKeeper(BspServiceMaster.java:1602)
        at
org.apache.giraph.master.BspServiceMaster.cleanup(BspServiceMaster.java:1692)
        at org.apache.giraph.master.MasterThread.run(MasterThread.java:144)



Re: Deadlock when running on Hadoop 1.0.4

Posted by Claudio Martella <cl...@gmail.com>.
Quite honestly I do not believe it is connected with ZK, it is quite weird
that it does not pass tests in pseudo-distributed mode... I think it is
quite serious that we cannot run tests on 1.0 not even in
pseudo-distributed mode. I honestly do not know when the bug was
introduced. I think IT MIGHT be connected with multi-threading, by looking
at the logs, but I cannot say for sure. What happens is that one worker
dies due to a Child Error at the computation of the last superstep (number
20), while the other succede and idle at the barrier. Looking at the logs,
the last entry for the failing worker is in the GraphMapper when the worker
announces the number of threads and partitions it is going to compute. That
is right before the compute thead is created and started. But this is
mostly speculation, before a thorough analysis.


On Sat, Jan 26, 2013 at 12:02 AM, Eli Reisman <ap...@gmail.com>wrote:

> Interesting. Dedicated zk instance doesn't work with hadoop-2.0.x or trunk
> either when running Giraph on YARN/MRv2. I would like to look into this
> more if I have time. Anyone have any ideas? And, anyone have a definitely
> timeline on how long this has been broken? Most of my work with Giraph last
> summer was on a cluster with its own ZK so I have not used the feature
> much. I do rememebr it working on 1.0.something hadoop profile at maybe
> christmas of 2011? But that was a long time ago...
>
>
> On Fri, Jan 25, 2013 at 3:07 AM, Sebastian Schelter <ss...@apache.org>wrote:
>
>> Hi,
>>
>> I get exactly the same deadlock when using a dedicated (non-distributed)
>> ZK instance. I tried 3.3.6 and 3.4.5.
>>
>> I haven't used Giraph for a while, so I can't say whether this has
>> worked recently...
>>
>> Best,
>> Sebastian
>>
>>
>>
>> On 23.01.2013 05:14, Eli Reisman wrote:
>> > Hi Sebastian,
>> >
>> > This seems to be a new issue related to our recent upgrade to
>> > multithreading. I have not seen this before. All my other emails
>> related to
>> > the array index out of bounds error you found over the weekend.
>> >
>> > however, I have had trouble with the local zk instance for some time
>> now on
>> > a number of Giraph profiles and pretty much exclusively use a separate
>> ZK
>> > instance of my own. Last summer I was running a lot of jobs on a 1.0.x
>> > hadoop cluster with Giraph, and I was told to use the on-cluster
>> dedicated
>> > ZK quorum due to "problems" with Giraph's local ZK instanantiation. No
>> one
>> > got more specific with me than that. I also can't get the local ZK
>> > instances to come up on Hadoop-2.0.x -- perhaps this feature of Giraph
>> has
>> > had problems for a while. Was it working for you recently?
>> >
>> > If you notice any other clues as to the cause, please post them I'm
>> hoping
>> > to do some work aorund this soon.
>> >
>> > On Tue, Jan 22, 2013 at 11:05 AM, Claudio Martella <
>> > claudio.martella@gmail.com> wrote:
>> >
>> >> Hi Sebastian,
>> >>
>> >> I do not know what is happening, I am also having problems of jobs
>> >> blocking while waiting to setup the zookeeper instance.
>> >> We should look into this.
>> >>
>> >> Best,
>> >> Claudio
>> >>
>> >>
>> >> On Mon, Jan 21, 2013 at 1:59 PM, Sebastian Schelter <ssc@apache.org
>> >wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> I'm testing a custom PageRank implementation using trunk on Hadoop
>> >>> 1.0.4. I seem to run into a deadlock after the input superstep.
>> >>>
>> >>> The workers report "finishSuperstep: (all workers done) WORKER_ONLY -
>> >>> Attempt=0, Superstep=0" and the master reports that all workers are
>> done
>> >>> with superstep -1.
>> >>>
>> >>> I reconstructed this using a local setup and seems to me that the
>> >>> BspServiceMaster hangs in the cleanUpZooKeeper method infinitely.
>> >>>
>> >>> I'm not using a dedicated zk instance, I just have Giraph start one.
>> Any
>> >>> ideas what can be done to fix my problem?
>> >>>
>> >>> Best,
>> >>> Sebastian
>> >>>
>> >>>
>> >>> excerpt from jstack
>> >>>
>> >>> "org.apache.giraph.master.MasterThread" prio=10 tid=0x00007f29fc385000
>> >>> nid=0x29d1 waiting on condition [0x00007f2a09a5f000]
>> >>>    java.lang.Thread.State: TIMED_WAITING (parking)
>> >>>         at sun.misc.Unsafe.park(Native Method)
>> >>>         - parking to wait for  <0x00000000f38967d8> (a
>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >>>         at
>> >>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>> >>>         at
>> >>>
>> >>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
>> >>>         at
>> >>> org.apache.giraph.zk.PredicateLock.waitMsecs(PredicateLock.java:112)
>> >>>         at
>> >>> org.apache.giraph.zk.PredicateLock.waitForever(PredicateLock.java:138)
>> >>>         at
>> >>>
>> >>>
>> org.apache.giraph.master.BspServiceMaster.cleanUpZooKeeper(BspServiceMaster.java:1602)
>> >>>         at
>> >>>
>> >>>
>> org.apache.giraph.master.BspServiceMaster.cleanup(BspServiceMaster.java:1692)
>> >>>         at
>> >>> org.apache.giraph.master.MasterThread.run(MasterThread.java:144)
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >>    Claudio Martella
>> >>    claudio.martella@gmail.com
>> >>
>> >
>>
>>
>


-- 
   Claudio Martella
   claudio.martella@gmail.com

Re: Deadlock when running on Hadoop 1.0.4

Posted by Eli Reisman <ap...@gmail.com>.
Interesting. Dedicated zk instance doesn't work with hadoop-2.0.x or trunk
either when running Giraph on YARN/MRv2. I would like to look into this
more if I have time. Anyone have any ideas? And, anyone have a definitely
timeline on how long this has been broken? Most of my work with Giraph last
summer was on a cluster with its own ZK so I have not used the feature
much. I do rememebr it working on 1.0.something hadoop profile at maybe
christmas of 2011? But that was a long time ago...


On Fri, Jan 25, 2013 at 3:07 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi,
>
> I get exactly the same deadlock when using a dedicated (non-distributed)
> ZK instance. I tried 3.3.6 and 3.4.5.
>
> I haven't used Giraph for a while, so I can't say whether this has
> worked recently...
>
> Best,
> Sebastian
>
>
>
> On 23.01.2013 05:14, Eli Reisman wrote:
> > Hi Sebastian,
> >
> > This seems to be a new issue related to our recent upgrade to
> > multithreading. I have not seen this before. All my other emails related
> to
> > the array index out of bounds error you found over the weekend.
> >
> > however, I have had trouble with the local zk instance for some time now
> on
> > a number of Giraph profiles and pretty much exclusively use a separate ZK
> > instance of my own. Last summer I was running a lot of jobs on a 1.0.x
> > hadoop cluster with Giraph, and I was told to use the on-cluster
> dedicated
> > ZK quorum due to "problems" with Giraph's local ZK instanantiation. No
> one
> > got more specific with me than that. I also can't get the local ZK
> > instances to come up on Hadoop-2.0.x -- perhaps this feature of Giraph
> has
> > had problems for a while. Was it working for you recently?
> >
> > If you notice any other clues as to the cause, please post them I'm
> hoping
> > to do some work aorund this soon.
> >
> > On Tue, Jan 22, 2013 at 11:05 AM, Claudio Martella <
> > claudio.martella@gmail.com> wrote:
> >
> >> Hi Sebastian,
> >>
> >> I do not know what is happening, I am also having problems of jobs
> >> blocking while waiting to setup the zookeeper instance.
> >> We should look into this.
> >>
> >> Best,
> >> Claudio
> >>
> >>
> >> On Mon, Jan 21, 2013 at 1:59 PM, Sebastian Schelter <ssc@apache.org
> >wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm testing a custom PageRank implementation using trunk on Hadoop
> >>> 1.0.4. I seem to run into a deadlock after the input superstep.
> >>>
> >>> The workers report "finishSuperstep: (all workers done) WORKER_ONLY -
> >>> Attempt=0, Superstep=0" and the master reports that all workers are
> done
> >>> with superstep -1.
> >>>
> >>> I reconstructed this using a local setup and seems to me that the
> >>> BspServiceMaster hangs in the cleanUpZooKeeper method infinitely.
> >>>
> >>> I'm not using a dedicated zk instance, I just have Giraph start one.
> Any
> >>> ideas what can be done to fix my problem?
> >>>
> >>> Best,
> >>> Sebastian
> >>>
> >>>
> >>> excerpt from jstack
> >>>
> >>> "org.apache.giraph.master.MasterThread" prio=10 tid=0x00007f29fc385000
> >>> nid=0x29d1 waiting on condition [0x00007f2a09a5f000]
> >>>    java.lang.Thread.State: TIMED_WAITING (parking)
> >>>         at sun.misc.Unsafe.park(Native Method)
> >>>         - parking to wait for  <0x00000000f38967d8> (a
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >>>         at
> >>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
> >>>         at
> >>>
> >>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
> >>>         at
> >>> org.apache.giraph.zk.PredicateLock.waitMsecs(PredicateLock.java:112)
> >>>         at
> >>> org.apache.giraph.zk.PredicateLock.waitForever(PredicateLock.java:138)
> >>>         at
> >>>
> >>>
> org.apache.giraph.master.BspServiceMaster.cleanUpZooKeeper(BspServiceMaster.java:1602)
> >>>         at
> >>>
> >>>
> org.apache.giraph.master.BspServiceMaster.cleanup(BspServiceMaster.java:1692)
> >>>         at
> >>> org.apache.giraph.master.MasterThread.run(MasterThread.java:144)
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >>    Claudio Martella
> >>    claudio.martella@gmail.com
> >>
> >
>
>

Re: Deadlock when running on Hadoop 1.0.4

Posted by Sebastian Schelter <ss...@apache.org>.
Hi,

I get exactly the same deadlock when using a dedicated (non-distributed)
ZK instance. I tried 3.3.6 and 3.4.5.

I haven't used Giraph for a while, so I can't say whether this has
worked recently...

Best,
Sebastian



On 23.01.2013 05:14, Eli Reisman wrote:
> Hi Sebastian,
> 
> This seems to be a new issue related to our recent upgrade to
> multithreading. I have not seen this before. All my other emails related to
> the array index out of bounds error you found over the weekend.
> 
> however, I have had trouble with the local zk instance for some time now on
> a number of Giraph profiles and pretty much exclusively use a separate ZK
> instance of my own. Last summer I was running a lot of jobs on a 1.0.x
> hadoop cluster with Giraph, and I was told to use the on-cluster dedicated
> ZK quorum due to "problems" with Giraph's local ZK instanantiation. No one
> got more specific with me than that. I also can't get the local ZK
> instances to come up on Hadoop-2.0.x -- perhaps this feature of Giraph has
> had problems for a while. Was it working for you recently?
> 
> If you notice any other clues as to the cause, please post them I'm hoping
> to do some work aorund this soon.
> 
> On Tue, Jan 22, 2013 at 11:05 AM, Claudio Martella <
> claudio.martella@gmail.com> wrote:
> 
>> Hi Sebastian,
>>
>> I do not know what is happening, I am also having problems of jobs
>> blocking while waiting to setup the zookeeper instance.
>> We should look into this.
>>
>> Best,
>> Claudio
>>
>>
>> On Mon, Jan 21, 2013 at 1:59 PM, Sebastian Schelter <ss...@apache.org>wrote:
>>
>>> Hi,
>>>
>>> I'm testing a custom PageRank implementation using trunk on Hadoop
>>> 1.0.4. I seem to run into a deadlock after the input superstep.
>>>
>>> The workers report "finishSuperstep: (all workers done) WORKER_ONLY -
>>> Attempt=0, Superstep=0" and the master reports that all workers are done
>>> with superstep -1.
>>>
>>> I reconstructed this using a local setup and seems to me that the
>>> BspServiceMaster hangs in the cleanUpZooKeeper method infinitely.
>>>
>>> I'm not using a dedicated zk instance, I just have Giraph start one. Any
>>> ideas what can be done to fix my problem?
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>> excerpt from jstack
>>>
>>> "org.apache.giraph.master.MasterThread" prio=10 tid=0x00007f29fc385000
>>> nid=0x29d1 waiting on condition [0x00007f2a09a5f000]
>>>    java.lang.Thread.State: TIMED_WAITING (parking)
>>>         at sun.misc.Unsafe.park(Native Method)
>>>         - parking to wait for  <0x00000000f38967d8> (a
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>>         at
>>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>>>         at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
>>>         at
>>> org.apache.giraph.zk.PredicateLock.waitMsecs(PredicateLock.java:112)
>>>         at
>>> org.apache.giraph.zk.PredicateLock.waitForever(PredicateLock.java:138)
>>>         at
>>>
>>> org.apache.giraph.master.BspServiceMaster.cleanUpZooKeeper(BspServiceMaster.java:1602)
>>>         at
>>>
>>> org.apache.giraph.master.BspServiceMaster.cleanup(BspServiceMaster.java:1692)
>>>         at
>>> org.apache.giraph.master.MasterThread.run(MasterThread.java:144)
>>>
>>>
>>>
>>
>>
>> --
>>    Claudio Martella
>>    claudio.martella@gmail.com
>>
> 


Re: Deadlock when running on Hadoop 1.0.4

Posted by Eli Reisman <ap...@gmail.com>.
Hi Sebastian,

This seems to be a new issue related to our recent upgrade to
multithreading. I have not seen this before. All my other emails related to
the array index out of bounds error you found over the weekend.

however, I have had trouble with the local zk instance for some time now on
a number of Giraph profiles and pretty much exclusively use a separate ZK
instance of my own. Last summer I was running a lot of jobs on a 1.0.x
hadoop cluster with Giraph, and I was told to use the on-cluster dedicated
ZK quorum due to "problems" with Giraph's local ZK instanantiation. No one
got more specific with me than that. I also can't get the local ZK
instances to come up on Hadoop-2.0.x -- perhaps this feature of Giraph has
had problems for a while. Was it working for you recently?

If you notice any other clues as to the cause, please post them I'm hoping
to do some work aorund this soon.

On Tue, Jan 22, 2013 at 11:05 AM, Claudio Martella <
claudio.martella@gmail.com> wrote:

> Hi Sebastian,
>
> I do not know what is happening, I am also having problems of jobs
> blocking while waiting to setup the zookeeper instance.
> We should look into this.
>
> Best,
> Claudio
>
>
> On Mon, Jan 21, 2013 at 1:59 PM, Sebastian Schelter <ss...@apache.org>wrote:
>
>> Hi,
>>
>> I'm testing a custom PageRank implementation using trunk on Hadoop
>> 1.0.4. I seem to run into a deadlock after the input superstep.
>>
>> The workers report "finishSuperstep: (all workers done) WORKER_ONLY -
>> Attempt=0, Superstep=0" and the master reports that all workers are done
>> with superstep -1.
>>
>> I reconstructed this using a local setup and seems to me that the
>> BspServiceMaster hangs in the cleanUpZooKeeper method infinitely.
>>
>> I'm not using a dedicated zk instance, I just have Giraph start one. Any
>> ideas what can be done to fix my problem?
>>
>> Best,
>> Sebastian
>>
>>
>> excerpt from jstack
>>
>> "org.apache.giraph.master.MasterThread" prio=10 tid=0x00007f29fc385000
>> nid=0x29d1 waiting on condition [0x00007f2a09a5f000]
>>    java.lang.Thread.State: TIMED_WAITING (parking)
>>         at sun.misc.Unsafe.park(Native Method)
>>         - parking to wait for  <0x00000000f38967d8> (a
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>         at
>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>>         at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
>>         at
>> org.apache.giraph.zk.PredicateLock.waitMsecs(PredicateLock.java:112)
>>         at
>> org.apache.giraph.zk.PredicateLock.waitForever(PredicateLock.java:138)
>>         at
>>
>> org.apache.giraph.master.BspServiceMaster.cleanUpZooKeeper(BspServiceMaster.java:1602)
>>         at
>>
>> org.apache.giraph.master.BspServiceMaster.cleanup(BspServiceMaster.java:1692)
>>         at
>> org.apache.giraph.master.MasterThread.run(MasterThread.java:144)
>>
>>
>>
>
>
> --
>    Claudio Martella
>    claudio.martella@gmail.com
>

Re: Deadlock when running on Hadoop 1.0.4

Posted by Claudio Martella <cl...@gmail.com>.
Hi Sebastian,

I do not know what is happening, I am also having problems of jobs blocking
while waiting to setup the zookeeper instance.
We should look into this.

Best,
Claudio


On Mon, Jan 21, 2013 at 1:59 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi,
>
> I'm testing a custom PageRank implementation using trunk on Hadoop
> 1.0.4. I seem to run into a deadlock after the input superstep.
>
> The workers report "finishSuperstep: (all workers done) WORKER_ONLY -
> Attempt=0, Superstep=0" and the master reports that all workers are done
> with superstep -1.
>
> I reconstructed this using a local setup and seems to me that the
> BspServiceMaster hangs in the cleanUpZooKeeper method infinitely.
>
> I'm not using a dedicated zk instance, I just have Giraph start one. Any
> ideas what can be done to fix my problem?
>
> Best,
> Sebastian
>
>
> excerpt from jstack
>
> "org.apache.giraph.master.MasterThread" prio=10 tid=0x00007f29fc385000
> nid=0x29d1 waiting on condition [0x00007f2a09a5f000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000000f38967d8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>         at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
>         at
> org.apache.giraph.zk.PredicateLock.waitMsecs(PredicateLock.java:112)
>         at
> org.apache.giraph.zk.PredicateLock.waitForever(PredicateLock.java:138)
>         at
>
> org.apache.giraph.master.BspServiceMaster.cleanUpZooKeeper(BspServiceMaster.java:1602)
>         at
>
> org.apache.giraph.master.BspServiceMaster.cleanup(BspServiceMaster.java:1692)
>         at org.apache.giraph.master.MasterThread.run(MasterThread.java:144)
>
>
>


-- 
   Claudio Martella
   claudio.martella@gmail.com