You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by Avery Ching <av...@gmail.com> on 2012/10/05 02:45:24 UTC

Review Request: GIRAPH-356: Help debug ZooKeeper issues

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7453/
-----------------------------------------------------------

Review request for giraph.


Description
-------

Here is an example of a master task failure when there is an invalid JVM argument passed to ZooKeeper. The error is much for obvious now.

2012-10-04 15:05:28,916 WARN org.apache.giraph.zk.ZooKeeperManager: logZooKeeperOutput: Dumping up to last 100 lines of the ZooKeeper process STDOUT and STDERR.
2012-10-04 15:05:28,916 WARN org.apache.giraph.zk.ZooKeeperManager$StreamCollector: Unrecognized option: -BadOpt
2012-10-04 15:05:28,916 WARN org.apache.giraph.zk.ZooKeeperManager$StreamCollector: Could not create the Java virtual machine.
2012-10-04 15:05:28,919 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-10-04 15:05:28,959 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.IllegalStateException: run: Caught an unrecoverable exception onlineZooKeeperServers: Failed to connect in 5 tries!
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:591)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.IllegalStateException: onlineZooKeeperServers: Failed to connect in 5 tries!
at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:721)
at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:328)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:573)
... 7 more
2012-10-04 15:05:28,963 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task


This addresses bug GIRAPH-356.
    https://issues.apache.org/jira/browse/GIRAPH-356


Diffs
-----

  http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1393852 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/GiraphConfiguration.java 1393852 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1393852 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java 1393852 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperManager.java 1393852 

Diff: https://reviews.apache.org/r/7453/diff/


Testing
-------

Ran unittests and tested several failures on real jobs.


Thanks,

Avery Ching


Re: Review Request: GIRAPH-356: Help debug ZooKeeper issues

Posted by Avery Ching <av...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7453/
-----------------------------------------------------------

(Updated Oct. 5, 2012, 8:44 p.m.)


Review request for giraph.


Changes
-------

-Configuration ZooKeeper connection attempts, min/max session timeout, force sync (off for perf), skip ACLS (no for perf)
-Do not kill job on a disconnect event, it's still possible for the client to connect again, only session expired is bad
-Dump failed workers on the master when a superstep does not get started due to missing ZooKeeper health
-Dump last 100 lines of ZooKeeper process stdout/stderr when there is a failure that could be related to ZooKeeper
-Small change for more descriptive message when can't find last good checkpoint

Updated to GIRAPH-356.2.patch


Description
-------

Here is an example of a master task failure when there is an invalid JVM argument passed to ZooKeeper. The error is much for obvious now.

2012-10-04 15:05:28,916 WARN org.apache.giraph.zk.ZooKeeperManager: logZooKeeperOutput: Dumping up to last 100 lines of the ZooKeeper process STDOUT and STDERR.
2012-10-04 15:05:28,916 WARN org.apache.giraph.zk.ZooKeeperManager$StreamCollector: Unrecognized option: -BadOpt
2012-10-04 15:05:28,916 WARN org.apache.giraph.zk.ZooKeeperManager$StreamCollector: Could not create the Java virtual machine.
2012-10-04 15:05:28,919 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-10-04 15:05:28,959 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.IllegalStateException: run: Caught an unrecoverable exception onlineZooKeeperServers: Failed to connect in 5 tries!
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:591)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.IllegalStateException: onlineZooKeeperServers: Failed to connect in 5 tries!
at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:721)
at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:328)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:573)
... 7 more
2012-10-04 15:05:28,963 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task


This addresses bug GIRAPH-356.
    https://issues.apache.org/jira/browse/GIRAPH-356


Diffs (updated)
-----

  http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1393852 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/GiraphConfiguration.java 1393852 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java 1393852 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1393852 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java 1393852 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperManager.java 1393852 

Diff: https://reviews.apache.org/r/7453/diff/


Testing
-------

Ran unittests and tested several failures on real jobs.


Thanks,

Avery Ching