You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org> on 2011/09/04 07:28:09 UTC

[jira] [Created] (GIRAPH-25) NPE in BspServiceMaster when failing a job

NPE in BspServiceMaster when failing a job
------------------------------------------

                 Key: GIRAPH-25
                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
             Project: Giraph
          Issue Type: Bug
            Reporter: Dmitriy V. Ryaboy
            Priority: Minor


When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100667#comment-13100667 ] 

Avery Ching edited comment on GIRAPH-25 at 9/8/11 8:31 PM:
-----------------------------------------------------------

Yup, I added you and Jake to the contributors list and assigned to you.  I agree with your commit message description to not fill up the svn logs.

      was (Author: aching):
    Yup, I added you and Jakob to the contributors list and assigned to you.  I agree with your commit message description to not fill up the svn logs.
  
> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096832#comment-13096832 ] 

Avery Ching commented on GIRAPH-25:
-----------------------------------

Definitely should be handled more gracefully.  Thanks for filing the issue.

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Priority: Minor
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101962#comment-13101962 ] 

Avery Ching commented on GIRAPH-25:
-----------------------------------

Thanks for the advice.  I'll be doing the same this weekend =).

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096816#comment-13096816 ] 

Dmitriy V. Ryaboy commented on GIRAPH-25:
-----------------------------------------

Here's the log I saw on a timed out master:

{code}
2011-09-04 05:22:11,115 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 182 responses of 186 needed to start superstep -1.  Sleeping for 30000 msecs and used 9 of 10 attempts.
2011-09-04 05:22:11,115 WARN org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 182 of 186 required)
2011-09-04 05:22:11,120 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep -1
2011-09-04 05:22:11,129 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201109012213_17306
2011-09-04 05:22:11,159 ERROR org.apache.giraph.graph.MasterThread: masterThread: Master algorithm failed: 
java.lang.NullPointerException
	at org.apache.giraph.graph.BspServiceMaster.createInputSplits(BspServiceMaster.java:486)
	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:94)
2011-09-04 05:22:11,160 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, exiting...
java.lang.RuntimeException: java.lang.NullPointerException
	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:177)
Caused by: java.lang.NullPointerException
	at org.apache.giraph.graph.BspServiceMaster.createInputSplits(BspServiceMaster.java:486)
	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:94)
2011-09-04 05:22:11,161 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.
{code}

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Priority: Minor
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100590#comment-13100590 ] 

Dmitriy V. Ryaboy commented on GIRAPH-25:
-----------------------------------------

Thanks Avery!
Mind adding me to the contributors list on the project so I can post-factum "assign" this one to myself?

FYI the way we've done the attribution in Pig (and Hadoop, I think) in the commit message is the more succinct "JIRA-123: description. $patch_author via $committer."

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching reassigned GIRAPH-25:
---------------------------------

    Assignee: Dmitriy V. Ryaboy

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching resolved GIRAPH-25.
-------------------------------

    Resolution: Fixed

Not sure if I am supposed to close this issue, or the reporter should, but I'll close it since it's been committed.  Please reopen if there is an issue.

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101711#comment-13101711 ] 

Dmitriy V. Ryaboy commented on GIRAPH-25:
-----------------------------------------

I think usually committer resolves the issue.

Thanks for taking the patch! I'm going to try and break Giraph in a few more ways this weekend :-)

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-25:
------------------------------

    Attachment: GIRAPH-25.2.patch

Minor changes to the original (unittest, error message).

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated GIRAPH-25:
------------------------------------

    Attachment: GIRAPH-25.patch

Attached a basic fix.

The problem was that failing the job did everything correctly, but did not stop BspServiceMaster to proceed. 

There are two choices here -- declare an exception and throw it in this case, and deal with that upstream; or, c-style, return a -1. I chose the latter because it makes code that deals with this more succinct and it didn't change a public api. But I can rewrite if you prefer to throw an exception.

No test as I wasn't sure how best to fit this into the way the tests are set up.

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100564#comment-13100564 ] 

Avery Ching commented on GIRAPH-25:
-----------------------------------

Patch worked nicely.  I added a unittest and tweaked an error message.  Here's some example output I got (looks much better).

...
2011-09-08 11:20:35,203 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 0 responses of 32767 needed to start superstep -1.  Sleeping for 1 msecs and used 0 of 1 attempts.
2011-09-08 11:20:35,203 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 0 of 32767 required).  This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
2011-09-08 11:20:35,276 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep -1
2011-09-08 11:20:35,333 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201109080935_0009
2011-09-08 11:20:35,619 INFO org.apache.giraph.graph.BspServiceMaster: cleanup: Notifying master its okay to cleanup with /_hadoopBsp/job_201109080935_0009/_cleanedUpDir/0_master
2011-09-08 11:20:35,620 INFO org.apache.giraph.graph.BspServiceMaster: cleanUpZooKeeper: Node /_hadoopBsp/job_201109080935_0009/_cleanedUpDir already exists, no need to create.
2011-09-08 11:20:35,621 INFO org.apache.giraph.graph.BspServiceMaster: cleanUpZooKeeper: Got 1 of 32768 desired children from /_hadoopBsp/job_201109080935_0009/_cleanedUpDir
2011-09-08 11:20:35,621 INFO org.apache.giraph.graph.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children of /_hadoopBsp/job_201109080935_0009/_cleanedUpDir to change since only got 1 nodes.
2011-09-08 11:20:38,182 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.

I'll upload the minor changes and then commit it on your behalf.  I ran unittests in local mode and also on a small Hadoop instance.  Thanks!


> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13099055#comment-13099055 ] 

Avery Ching commented on GIRAPH-25:
-----------------------------------

Thanks for the patch Dmitriy!  I'll review it, add a unittest and the commit if it works as expected.

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100575#comment-13100575 ] 

Hudson commented on GIRAPH-25:
------------------------------

Integrated in Giraph-trunk-Commit #2 (See [https://builds.apache.org/job/Giraph-trunk-Commit/2/])
    GIRAPH-25 NPE in BspServiceMaster when failing a job (committed by
aching on behalf of dvryaboy).

aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1166854
Files : 
* /incubator/giraph/trunk/CHANGELOG
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceMaster.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/MasterThread.java
* /incubator/giraph/trunk/src/test/java/org/apache/giraph/TestNotEnoughMapTasks.java


> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-25) NPE in BspServiceMaster when failing a job

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100667#comment-13100667 ] 

Avery Ching commented on GIRAPH-25:
-----------------------------------

Yup, I added you and Jakob to the contributors list and assigned to you.  I agree with your commit message description to not fill up the svn logs.

> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira