You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@giraph.apache.org by "Avery Ching (Created) (JIRA)" <ji...@apache.org> on 2011/10/01 20:14:33 UTC

[jira] [Created] (GIRAPH-46) Race condition on superstep 1 with RPC servers not started by the time that requests are sent

Race condition on superstep 1 with RPC servers not started by the time that requests are sent
---------------------------------------------------------------------------------------------

                 Key: GIRAPH-46
                 URL: https://issues.apache.org/jira/browse/GIRAPH-46
             Project: Giraph
          Issue Type: Bug
    Affects Versions: 0.70.0
            Reporter: Avery Ching
            Assignee: Avery Ching
            Priority: Minor
             Fix For: 0.70.0
         Attachments: diff.txt

Hi,

occasionally (maybe one time in four), my giraph run fails because of the below RuntimeException.
According to code, it should never happen:

if (msgMap == null) { // should never happen after constructor throw new RuntimeException( "sendMessage: msgMap did not exist for " + addr + " for vertex " + destVertex); }

This happens during superstep 1 (second superstep). My application actually *adds* edges on superstep 1
(to make every out-edge also an in-edge of the destination), but since I am running only on 3 workers,
I am surprised if every worker would not had been registered in the RPC layer initially.

One hypothesis is that Hadoop does something funny, because one of my server was under heavy
load. Maybe Hadoop launched another worker to replace a slow worker? Can it happen?

java.lang.RuntimeException: sendMessage: msgMap did not exist for [hostname].ml.cmu.edu:30003 for vertex 875713
        at org.apache.giraph.comm.BasicRPCCommunications.sendMessageReq(BasicRPCCommunications.java:825)
        at org.apache.giraph.graph.BasicVertex.sendMsg(BasicVertex.java:179)
        at edu.cmu.selectlab.BP.BinaryBPVertex.compute(BinaryBPVertex.java:94)
        at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:624)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.Child.main(Child.java:253)



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-46) Race condition on superstep 1 with RPC servers not started by the time that requests are sent

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-46?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118871#comment-13118871 ] 

Hudson commented on GIRAPH-46:
------------------------------

Integrated in Giraph-trunk-Commit #12 (See [https://builds.apache.org/job/Giraph-trunk-Commit/12/])
    GIRAPH-46: Race condition on superstep 1 with RPC servers not started
by the time that requests are sent. (aching)

aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1178065
Files : 
* /incubator/giraph/trunk/CHANGELOG
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java

                
> Race condition on superstep 1 with RPC servers not started by the time that requests are sent
> ---------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-46
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-46
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Minor
>             Fix For: 0.70.0
>
>         Attachments: diff.txt
>
>
> Hi,
> occasionally (maybe one time in four), my giraph run fails because of the below RuntimeException.
> According to code, it should never happen:
> if (msgMap == null) { // should never happen after constructor throw new RuntimeException( "sendMessage: msgMap did not exist for " + addr + " for vertex " + destVertex); }
> This happens during superstep 1 (second superstep). My application actually *adds* edges on superstep 1
> (to make every out-edge also an in-edge of the destination), but since I am running only on 3 workers,
> I am surprised if every worker would not had been registered in the RPC layer initially.
> One hypothesis is that Hadoop does something funny, because one of my server was under heavy
> load. Maybe Hadoop launched another worker to replace a slow worker? Can it happen?
> java.lang.RuntimeException: sendMessage: msgMap did not exist for [hostname].ml.cmu.edu:30003 for vertex 875713
>         at org.apache.giraph.comm.BasicRPCCommunications.sendMessageReq(BasicRPCCommunications.java:825)
>         at org.apache.giraph.graph.BasicVertex.sendMsg(BasicVertex.java:179)
>         at edu.cmu.selectlab.BP.BinaryBPVertex.compute(BinaryBPVertex.java:94)
>         at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:624)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>         at org.apache.hadoop.mapred.Child.main(Child.java:253)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-46) Race condition on superstep 1 with RPC servers not started by the time that requests are sent

Posted by "Avery Ching (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-46:
------------------------------

    Attachment: diff.txt

Aapo reported success and I was able to run unittests against LocalJobRunner and my local Hadoop instance.
                
> Race condition on superstep 1 with RPC servers not started by the time that requests are sent
> ---------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-46
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-46
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Minor
>             Fix For: 0.70.0
>
>         Attachments: diff.txt
>
>
> Hi,
> occasionally (maybe one time in four), my giraph run fails because of the below RuntimeException.
> According to code, it should never happen:
> if (msgMap == null) { // should never happen after constructor throw new RuntimeException( "sendMessage: msgMap did not exist for " + addr + " for vertex " + destVertex); }
> This happens during superstep 1 (second superstep). My application actually *adds* edges on superstep 1
> (to make every out-edge also an in-edge of the destination), but since I am running only on 3 workers,
> I am surprised if every worker would not had been registered in the RPC layer initially.
> One hypothesis is that Hadoop does something funny, because one of my server was under heavy
> load. Maybe Hadoop launched another worker to replace a slow worker? Can it happen?
> java.lang.RuntimeException: sendMessage: msgMap did not exist for [hostname].ml.cmu.edu:30003 for vertex 875713
>         at org.apache.giraph.comm.BasicRPCCommunications.sendMessageReq(BasicRPCCommunications.java:825)
>         at org.apache.giraph.graph.BasicVertex.sendMsg(BasicVertex.java:179)
>         at edu.cmu.selectlab.BP.BinaryBPVertex.compute(BinaryBPVertex.java:94)
>         at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:624)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>         at org.apache.hadoop.mapred.Child.main(Child.java:253)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (GIRAPH-46) Race condition on superstep 1 with RPC servers not started by the time that requests are sent

Posted by "Avery Ching (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching resolved GIRAPH-46.
-------------------------------

    Resolution: Fixed

Committed, thanks Jakob for the review.
                
> Race condition on superstep 1 with RPC servers not started by the time that requests are sent
> ---------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-46
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-46
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Minor
>             Fix For: 0.70.0
>
>         Attachments: diff.txt
>
>
> Hi,
> occasionally (maybe one time in four), my giraph run fails because of the below RuntimeException.
> According to code, it should never happen:
> if (msgMap == null) { // should never happen after constructor throw new RuntimeException( "sendMessage: msgMap did not exist for " + addr + " for vertex " + destVertex); }
> This happens during superstep 1 (second superstep). My application actually *adds* edges on superstep 1
> (to make every out-edge also an in-edge of the destination), but since I am running only on 3 workers,
> I am surprised if every worker would not had been registered in the RPC layer initially.
> One hypothesis is that Hadoop does something funny, because one of my server was under heavy
> load. Maybe Hadoop launched another worker to replace a slow worker? Can it happen?
> java.lang.RuntimeException: sendMessage: msgMap did not exist for [hostname].ml.cmu.edu:30003 for vertex 875713
>         at org.apache.giraph.comm.BasicRPCCommunications.sendMessageReq(BasicRPCCommunications.java:825)
>         at org.apache.giraph.graph.BasicVertex.sendMsg(BasicVertex.java:179)
>         at edu.cmu.selectlab.BP.BinaryBPVertex.compute(BinaryBPVertex.java:94)
>         at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:624)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>         at org.apache.hadoop.mapred.Child.main(Child.java:253)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-46) Race condition on superstep 1 with RPC servers not started by the time that requests are sent

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-46?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118865#comment-13118865 ] 

Jakob Homan commented on GIRAPH-46:
-----------------------------------

+1
                
> Race condition on superstep 1 with RPC servers not started by the time that requests are sent
> ---------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-46
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-46
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Minor
>             Fix For: 0.70.0
>
>         Attachments: diff.txt
>
>
> Hi,
> occasionally (maybe one time in four), my giraph run fails because of the below RuntimeException.
> According to code, it should never happen:
> if (msgMap == null) { // should never happen after constructor throw new RuntimeException( "sendMessage: msgMap did not exist for " + addr + " for vertex " + destVertex); }
> This happens during superstep 1 (second superstep). My application actually *adds* edges on superstep 1
> (to make every out-edge also an in-edge of the destination), but since I am running only on 3 workers,
> I am surprised if every worker would not had been registered in the RPC layer initially.
> One hypothesis is that Hadoop does something funny, because one of my server was under heavy
> load. Maybe Hadoop launched another worker to replace a slow worker? Can it happen?
> java.lang.RuntimeException: sendMessage: msgMap did not exist for [hostname].ml.cmu.edu:30003 for vertex 875713
>         at org.apache.giraph.comm.BasicRPCCommunications.sendMessageReq(BasicRPCCommunications.java:825)
>         at org.apache.giraph.graph.BasicVertex.sendMsg(BasicVertex.java:179)
>         at edu.cmu.selectlab.BP.BinaryBPVertex.compute(BinaryBPVertex.java:94)
>         at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:624)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>         at org.apache.hadoop.mapred.Child.main(Child.java:253)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira