You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by Eli Reisman <ap...@gmail.com> on 2014/02/09 19:33:59 UTC

Re: BspServiceMaster: cleanedUpZooKeeper infinite loop

Hi Eric,

Can you take a look at GIRAPH-747 I think it relates. We had some issues in
the original pre-Hadoop-2.2 GA YARN profile that I had to shim. I think
something changed in the switchover to 2.2 or on the non-YARN side that
made this an issue again. We produce an ApplicationMaster _and_ a master
node while non-YARN only needs an extra master node, and as I recall this
is the underlying issue. The current proposed fix (as I read it) breaks
non-YARN Giraph.

Mohammed, any input on this? If any non-YARN committers could take a peek
at this email thread and Eric's error and at GIRAPH-747 and confirm my
suspicion that this solution won't work as-is, that would be great.

Thanks all,

Eli




On Thu, Jan 30, 2014 at 9:49 AM, Eric Kimbrel <le...@gmail.com> wrote:

> Hello,  I am currently not a contributor to this project but have noticed
> an issue i wanted to report here instead of on the users mailing list.
>
> using 1.1.0-SNAPSHOT built for PURE YARN and cdh5.0.0
>
> I have an intermittent problem that, when it occurs, causes the job to
> stall after completion (but prior to vertices writing their output).
> Looking into the logs (posted below) I see that i go from 7 of 8 workers
> reporting completion to 9 of 8.  The code in BspServiceMaster:1740 users
> cleanedUpChildrenList.size() == maxTasks inside of a while true loop, so
> the job gets stuck here forever and will never progress again.
>
> I plan on changing this locally to a >= for my own use to prevent this
> problem, but i don't know how 9 of 8 is being reported and how this problem
> is really happening.
>
> Thanks for any ideas,
> Eric
>
>
> 14/01/30 09:35:37 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 1 of
> 8 desired children from
> /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
> 14/01/30 09:35:37 INFO master.BspServiceMaster: cleanedUpZooKeeper:
> Waiting for the children of
> /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to
> change since only got 1 nodes.
> 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged
> signaled
> 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 2 of
> 8 desired children from
> /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
> 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper:
> Waiting for the children of
> /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to
> change since only got 2 nodes.
> 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged
> signaled
> 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 5 of
> 8 desired children from
> /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
> 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper:
> Waiting for the children of
> /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to
> change since only got 5 nodes.
> 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged
> signaled
> 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 6 of
> 8 desired children from
> /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
> 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper:
> Waiting for the children of
> /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to
> change since only got 6 nodes.
> 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged
> signaled
> 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 9 of
> 8 desired children from
> /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
> 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper:
> Waiting for the children of
> /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to
> change since only got 9 nodes.