You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "locker (Created) (JIRA)" <ji...@apache.org> on 2011/10/14 11:50:11 UTC

[jira] [Created] (GIRAPH-53) Unable to read additional data from server session, likely server has closed socket

Unable to read additional data from server session, likely server has closed socket
-----------------------------------------------------------------------------------

                 Key: GIRAPH-53
                 URL: https://issues.apache.org/jira/browse/GIRAPH-53
             Project: Giraph
          Issue Type: Bug
            Reporter: locker


I've got an error recently. Every thing goes well till it comes to the 103rd superstep. 

2011-10-14 16:23:38,904 INFO org.apache.giraph.comm.BasicRPCCommunications: prepareSuperstep
2011-10-14 16:23:39,018 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_vertexRangeAssignments, type=NodeDeleted, state=SyncConnected)
2011-10-14 16:23:39,057 INFO org.apache.giraph.graph.BspServiceWorker: registerHealth: Created my health node for attempt=0, superstep=103 with /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_workerHealthyDir/locker-desktop_1 and hostnamePort = ["locker-desktop",30001]
2011-10-14 16:23:39,057 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_superstepFinished, type=NodeDeleted, state=SyncConnected)
2011-10-14 16:23:39,529 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1330186cff30001, likely server has closed socket, closing socket connection and attempting reconnect
2011-10-14 16:23:39,630 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot recover.
	at org.apache.giraph.graph.BspService.process(BspService.java:995)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
2011-10-14 16:23:41,098 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server locker-desktop/10.13.30.90:22181
2011-10-14 16:23:41,099 WARN org.apache.zookeeper.ClientCnxn: Session 0x1330186cff30001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
2011-10-14 16:23:41,212 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2011-10-14 16:23:41,306 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2011-10-14 16:23:41,307 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName dic for UID 1001 from the native implementation
2011-10-14 16:23:41,318 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:836)
	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:551)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
	at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:830)
	... 9 more
I dont know whether it should be called a bug or not. Wait for some help, thx...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-53) Unable to read additional data from server session, likely server has closed socket

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134723#comment-13134723 ] 

Avery Ching commented on GIRAPH-53:
-----------------------------------

No, the map tasks are held for the duration of the application (no matter how many supersteps).  That is a huge benefit when compared to implementing iterative graph applications on a traditional MapReduce framework.
                
> Unable to read additional data from server session, likely server has closed socket
> -----------------------------------------------------------------------------------
>
>                 Key: GIRAPH-53
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-53
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: locker
>
> I've got an error recently. Every thing goes well till it comes to the 103rd superstep. 
> 2011-10-14 16:23:38,904 INFO org.apache.giraph.comm.BasicRPCCommunications: prepareSuperstep
> 2011-10-14 16:23:39,018 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_vertexRangeAssignments, type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,057 INFO org.apache.giraph.graph.BspServiceWorker: registerHealth: Created my health node for attempt=0, superstep=103 with /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_workerHealthyDir/locker-desktop_1 and hostnamePort = ["locker-desktop",30001]
> 2011-10-14 16:23:39,057 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_superstepFinished, type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,529 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1330186cff30001, likely server has closed socket, closing socket connection and attempting reconnect
> 2011-10-14 16:23:39,630 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot recover.
> 	at org.apache.giraph.graph.BspService.process(BspService.java:995)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> 2011-10-14 16:23:41,098 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server locker-desktop/10.13.30.90:22181
> 2011-10-14 16:23:41,099 WARN org.apache.zookeeper.ClientCnxn: Session 0x1330186cff30001 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> 	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
> 2011-10-14 16:23:41,212 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2011-10-14 16:23:41,306 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
> 2011-10-14 16:23:41,307 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName dic for UID 1001 from the native implementation
> 2011-10-14 16:23:41,318 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:836)
> 	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:551)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
> 	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:830)
> 	... 9 more
> I dont know whether it should be called a bug or not. Wait for some help, thx...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-53) Unable to read additional data from server session, likely server has closed socket

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135561#comment-13135561 ] 

Jakob Homan commented on GIRAPH-53:
-----------------------------------

At the very least, immediately we should catch this type of exception and report it better.  I've seen the coordinator go down for lots of reasons and the workers all blow up in a spectacular way.  It'd be better if they caught this event and shut down relatively cleanly.  This is a better user experience and provides a cleaner path to debugging what caused the coordinator to go down.
                
> Unable to read additional data from server session, likely server has closed socket
> -----------------------------------------------------------------------------------
>
>                 Key: GIRAPH-53
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-53
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: locker
>
> I've got an error recently. Every thing goes well till it comes to the 103rd superstep. 
> 2011-10-14 16:23:38,904 INFO org.apache.giraph.comm.BasicRPCCommunications: prepareSuperstep
> 2011-10-14 16:23:39,018 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_vertexRangeAssignments, type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,057 INFO org.apache.giraph.graph.BspServiceWorker: registerHealth: Created my health node for attempt=0, superstep=103 with /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_workerHealthyDir/locker-desktop_1 and hostnamePort = ["locker-desktop",30001]
> 2011-10-14 16:23:39,057 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_superstepFinished, type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,529 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1330186cff30001, likely server has closed socket, closing socket connection and attempting reconnect
> 2011-10-14 16:23:39,630 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot recover.
> 	at org.apache.giraph.graph.BspService.process(BspService.java:995)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> 2011-10-14 16:23:41,098 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server locker-desktop/10.13.30.90:22181
> 2011-10-14 16:23:41,099 WARN org.apache.zookeeper.ClientCnxn: Session 0x1330186cff30001 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> 	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
> 2011-10-14 16:23:41,212 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2011-10-14 16:23:41,306 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
> 2011-10-14 16:23:41,307 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName dic for UID 1001 from the native implementation
> 2011-10-14 16:23:41,318 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:836)
> 	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:551)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
> 	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:830)
> 	... 9 more
> I dont know whether it should be called a bug or not. Wait for some help, thx...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-53) Unable to read additional data from server session, likely server has closed socket

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127702#comment-13127702 ] 

Avery Ching commented on GIRAPH-53:
-----------------------------------

Also, I wonder if it's related to the counter issues, see https://issues.apache.org/jira/browse/GIRAPH-52.  You can try to disable the superstep counters with the job option "-Dgiraph.useSuperstepCounters=false", then see if the problem still occurs.
                
> Unable to read additional data from server session, likely server has closed socket
> -----------------------------------------------------------------------------------
>
>                 Key: GIRAPH-53
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-53
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: locker
>
> I've got an error recently. Every thing goes well till it comes to the 103rd superstep. 
> 2011-10-14 16:23:38,904 INFO org.apache.giraph.comm.BasicRPCCommunications: prepareSuperstep
> 2011-10-14 16:23:39,018 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_vertexRangeAssignments, type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,057 INFO org.apache.giraph.graph.BspServiceWorker: registerHealth: Created my health node for attempt=0, superstep=103 with /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_workerHealthyDir/locker-desktop_1 and hostnamePort = ["locker-desktop",30001]
> 2011-10-14 16:23:39,057 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_superstepFinished, type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,529 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1330186cff30001, likely server has closed socket, closing socket connection and attempting reconnect
> 2011-10-14 16:23:39,630 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot recover.
> 	at org.apache.giraph.graph.BspService.process(BspService.java:995)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> 2011-10-14 16:23:41,098 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server locker-desktop/10.13.30.90:22181
> 2011-10-14 16:23:41,099 WARN org.apache.zookeeper.ClientCnxn: Session 0x1330186cff30001 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> 	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
> 2011-10-14 16:23:41,212 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2011-10-14 16:23:41,306 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
> 2011-10-14 16:23:41,307 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName dic for UID 1001 from the native implementation
> 2011-10-14 16:23:41,318 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:836)
> 	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:551)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
> 	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:830)
> 	... 9 more
> I dont know whether it should be called a bug or not. Wait for some help, thx...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-53) Unable to read additional data from server session, likely server has closed socket

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127626#comment-13127626 ] 

Avery Ching commented on GIRAPH-53:
-----------------------------------

Thanks for reporting the issue.  A few questions:

1)  Is it always the 103rd superstep?

2)  It looks like the task lost its connection to the ZooKeeper service.  Probably good to see what happen to that task as well.  Most likely it crashed for some reason.
                
> Unable to read additional data from server session, likely server has closed socket
> -----------------------------------------------------------------------------------
>
>                 Key: GIRAPH-53
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-53
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: locker
>
> I've got an error recently. Every thing goes well till it comes to the 103rd superstep. 
> 2011-10-14 16:23:38,904 INFO org.apache.giraph.comm.BasicRPCCommunications: prepareSuperstep
> 2011-10-14 16:23:39,018 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_vertexRangeAssignments, type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,057 INFO org.apache.giraph.graph.BspServiceWorker: registerHealth: Created my health node for attempt=0, superstep=103 with /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_workerHealthyDir/locker-desktop_1 and hostnamePort = ["locker-desktop",30001]
> 2011-10-14 16:23:39,057 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_superstepFinished, type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,529 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1330186cff30001, likely server has closed socket, closing socket connection and attempting reconnect
> 2011-10-14 16:23:39,630 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot recover.
> 	at org.apache.giraph.graph.BspService.process(BspService.java:995)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> 2011-10-14 16:23:41,098 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server locker-desktop/10.13.30.90:22181
> 2011-10-14 16:23:41,099 WARN org.apache.zookeeper.ClientCnxn: Session 0x1330186cff30001 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> 	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
> 2011-10-14 16:23:41,212 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2011-10-14 16:23:41,306 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
> 2011-10-14 16:23:41,307 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName dic for UID 1001 from the native implementation
> 2011-10-14 16:23:41,318 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:836)
> 	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:551)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
> 	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:830)
> 	... 9 more
> I dont know whether it should be called a bug or not. Wait for some help, thx...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-53) Unable to read additional data from server session, likely server has closed socket

Posted by "locker (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134719#comment-13134719 ] 

locker commented on GIRAPH-53:
------------------------------

@Avery Ching Thx for ur advice. It works well now, looks like a problem caused by the superstep counter. 
Btw, I've another question on the framework of Giraph, that is, is every superstep a map-only job? I mean when a new superstep starts, is there the same costs as initiating a mapreduce job? thx...
                
> Unable to read additional data from server session, likely server has closed socket
> -----------------------------------------------------------------------------------
>
>                 Key: GIRAPH-53
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-53
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: locker
>
> I've got an error recently. Every thing goes well till it comes to the 103rd superstep. 
> 2011-10-14 16:23:38,904 INFO org.apache.giraph.comm.BasicRPCCommunications: prepareSuperstep
> 2011-10-14 16:23:39,018 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_vertexRangeAssignments, type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,057 INFO org.apache.giraph.graph.BspServiceWorker: registerHealth: Created my health node for attempt=0, superstep=103 with /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_workerHealthyDir/locker-desktop_1 and hostnamePort = ["locker-desktop",30001]
> 2011-10-14 16:23:39,057 WARN org.apache.giraph.graph.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_superstepFinished, type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,529 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1330186cff30001, likely server has closed socket, closing socket connection and attempting reconnect
> 2011-10-14 16:23:39,630 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot recover.
> 	at org.apache.giraph.graph.BspService.process(BspService.java:995)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> 2011-10-14 16:23:41,098 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server locker-desktop/10.13.30.90:22181
> 2011-10-14 16:23:41,099 WARN org.apache.zookeeper.ClientCnxn: Session 0x1330186cff30001 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> 	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
> 2011-10-14 16:23:41,212 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2011-10-14 16:23:41,306 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
> 2011-10-14 16:23:41,307 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName dic for UID 1001 from the native implementation
> 2011-10-14 16:23:41,318 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:836)
> 	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:551)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
> 	at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:830)
> 	... 9 more
> I dont know whether it should be called a bug or not. Wait for some help, thx...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira