You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org> on 2011/09/04 07:28:09 UTC
[jira] [Created] (GIRAPH-25) NPE in BspServiceMaster when failing a
job
NPE in BspServiceMaster when failing a job
------------------------------------------
Key: GIRAPH-25
URL: https://issues.apache.org/jira/browse/GIRAPH-25
Project: Giraph
Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Priority: Minor
When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (GIRAPH-25) NPE in BspServiceMaster
when failing a job
Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100667#comment-13100667 ]
Avery Ching edited comment on GIRAPH-25 at 9/8/11 8:31 PM:
-----------------------------------------------------------
Yup, I added you and Jake to the contributors list and assigned to you. I agree with your commit message description to not fill up the svn logs.
was (Author: aching):
Yup, I added you and Jakob to the contributors list and assigned to you. I agree with your commit message description to not fill up the svn logs.
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096832#comment-13096832 ]
Avery Ching commented on GIRAPH-25:
-----------------------------------
Definitely should be handled more gracefully. Thanks for filing the issue.
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Priority: Minor
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101962#comment-13101962 ]
Avery Ching commented on GIRAPH-25:
-----------------------------------
Thanks for the advice. I'll be doing the same this weekend =).
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096816#comment-13096816 ]
Dmitriy V. Ryaboy commented on GIRAPH-25:
-----------------------------------------
Here's the log I saw on a timed out master:
{code}
2011-09-04 05:22:11,115 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 182 responses of 186 needed to start superstep -1. Sleeping for 30000 msecs and used 9 of 10 attempts.
2011-09-04 05:22:11,115 WARN org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 182 of 186 required)
2011-09-04 05:22:11,120 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep -1
2011-09-04 05:22:11,129 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201109012213_17306
2011-09-04 05:22:11,159 ERROR org.apache.giraph.graph.MasterThread: masterThread: Master algorithm failed:
java.lang.NullPointerException
at org.apache.giraph.graph.BspServiceMaster.createInputSplits(BspServiceMaster.java:486)
at org.apache.giraph.graph.MasterThread.run(MasterThread.java:94)
2011-09-04 05:22:11,160 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, exiting...
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.giraph.graph.MasterThread.run(MasterThread.java:177)
Caused by: java.lang.NullPointerException
at org.apache.giraph.graph.BspServiceMaster.createInputSplits(BspServiceMaster.java:486)
at org.apache.giraph.graph.MasterThread.run(MasterThread.java:94)
2011-09-04 05:22:11,161 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.
{code}
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Priority: Minor
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100590#comment-13100590 ]
Dmitriy V. Ryaboy commented on GIRAPH-25:
-----------------------------------------
Thanks Avery!
Mind adding me to the contributors list on the project so I can post-factum "assign" this one to myself?
FYI the way we've done the attribution in Pig (and Hadoop, I think) in the commit message is the more succinct "JIRA-123: description. $patch_author via $committer."
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Avery Ching reassigned GIRAPH-25:
---------------------------------
Assignee: Dmitriy V. Ryaboy
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Avery Ching resolved GIRAPH-25.
-------------------------------
Resolution: Fixed
Not sure if I am supposed to close this issue, or the reporter should, but I'll close it since it's been committed. Please reopen if there is an issue.
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101711#comment-13101711 ]
Dmitriy V. Ryaboy commented on GIRAPH-25:
-----------------------------------------
I think usually committer resolves the issue.
Thanks for taking the patch! I'm going to try and break Giraph in a few more ways this weekend :-)
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-25) NPE in BspServiceMaster when failing a
job
Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Avery Ching updated GIRAPH-25:
------------------------------
Attachment: GIRAPH-25.2.patch
Minor changes to the original (unittest, error message).
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-25) NPE in BspServiceMaster when failing a
job
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitriy V. Ryaboy updated GIRAPH-25:
------------------------------------
Attachment: GIRAPH-25.patch
Attached a basic fix.
The problem was that failing the job did everything correctly, but did not stop BspServiceMaster to proceed.
There are two choices here -- declare an exception and throw it in this case, and deal with that upstream; or, c-style, return a -1. I chose the latter because it makes code that deals with this more succinct and it didn't change a public api. But I can rewrite if you prefer to throw an exception.
No test as I wasn't sure how best to fit this into the way the tests are set up.
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100564#comment-13100564 ]
Avery Ching commented on GIRAPH-25:
-----------------------------------
Patch worked nicely. I added a unittest and tweaked an error message. Here's some example output I got (looks much better).
...
2011-09-08 11:20:35,203 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 0 responses of 32767 needed to start superstep -1. Sleeping for 1 msecs and used 0 of 1 attempts.
2011-09-08 11:20:35,203 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 0 of 32767 required). This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
2011-09-08 11:20:35,276 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep -1
2011-09-08 11:20:35,333 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201109080935_0009
2011-09-08 11:20:35,619 INFO org.apache.giraph.graph.BspServiceMaster: cleanup: Notifying master its okay to cleanup with /_hadoopBsp/job_201109080935_0009/_cleanedUpDir/0_master
2011-09-08 11:20:35,620 INFO org.apache.giraph.graph.BspServiceMaster: cleanUpZooKeeper: Node /_hadoopBsp/job_201109080935_0009/_cleanedUpDir already exists, no need to create.
2011-09-08 11:20:35,621 INFO org.apache.giraph.graph.BspServiceMaster: cleanUpZooKeeper: Got 1 of 32768 desired children from /_hadoopBsp/job_201109080935_0009/_cleanedUpDir
2011-09-08 11:20:35,621 INFO org.apache.giraph.graph.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children of /_hadoopBsp/job_201109080935_0009/_cleanedUpDir to change since only got 1 nodes.
2011-09-08 11:20:38,182 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.
I'll upload the minor changes and then commit it on your behalf. I ran unittests in local mode and also on a small Hadoop instance. Thanks!
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13099055#comment-13099055 ]
Avery Ching commented on GIRAPH-25:
-----------------------------------
Thanks for the patch Dmitriy! I'll review it, add a unittest and the commit if it works as expected.
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100575#comment-13100575 ]
Hudson commented on GIRAPH-25:
------------------------------
Integrated in Giraph-trunk-Commit #2 (See [https://builds.apache.org/job/Giraph-trunk-Commit/2/])
GIRAPH-25 NPE in BspServiceMaster when failing a job (committed by
aching on behalf of dvryaboy).
aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1166854
Files :
* /incubator/giraph/trunk/CHANGELOG
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceMaster.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/MasterThread.java
* /incubator/giraph/trunk/src/test/java/org/apache/giraph/TestNotEnoughMapTasks.java
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing
a job
Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100667#comment-13100667 ]
Avery Ching commented on GIRAPH-25:
-----------------------------------
Yup, I added you and Jakob to the contributors list and assigned to you. I agree with your commit message description to not fill up the svn logs.
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Minor
> Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira