You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by Arghya Kusum Das <ar...@gmail.com> on 2014/11/01 19:27:46 UTC

Giraph job fails on large data and large number of nodes

Hi,

My Giraph program is running for small data on smaller number of nodes (eg.
10GB data on 32 nodes) correctly.
I was trying to run it on 128 nodes with 32GB RAM, 16-cores and 240GB hdd
per node. The graph size is 91GB and it failed with the following exception
in the log. Can anyone help me to resolve it?

2014-11-01 12:54:43,364 INFO org.apache.giraph.comm.netty.NettyServer:
start: Using Netty without authentication.
2014-11-01 12:54:43,386 INFO org.apache.giraph.comm.netty.NettyServer:
start: Using Netty without authentication.
2014-11-01 12:54:43,414 INFO org.apache.giraph.comm.netty.NettyServer:
start: Using Netty without authentication.
2014-11-01 12:54:43,417 INFO org.apache.giraph.comm.netty.NettyServer:
start: Using Netty without authentication.
2014-11-01 12:54:44,363 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server mike070/204.90.45.70:22181
2014-11-01 12:54:44,364 WARN org.apache.zookeeper.ClientCnxn: Session
0x1496c7e01e30002 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-11-01 12:54:44,464 WARN org.apache.giraph.zk.ZooKeeperExt: createExt:
Connection loss on attempt 0, waiting 5000 msecs before retrying.
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/_hadoopBsp/job_201411011248_0003/_masterJobState
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
        at
org.apache.giraph.zk.ZooKeeperExt.createExt(ZooKeeperExt.java:152)
        at org.apache.giraph.bsp.BspService.getJobState(BspService.java:670)
        at
org.apache.giraph.master.BspServiceMaster.becomeMaster(BspServiceMaster.java:843)
        at org.apache.giraph.master.MasterThread.run(MasterThread.java:98)
2014-11-01 12:54:44,638 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server mike070/204.90.45.70:22181
2014-11-01 12:54:44,639 WARN org.apache.zookeeper.ClientCnxn: Session
0x1496c7e01e30010 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-11-01 12:54:46,159 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server mike070/204.90.45.70:22181
2014-11-01 12:54:46,159 WARN org.apache.zookeeper.ClientCnxn: Session
0x1496c7e01e30002 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-11-01 12:54:46,481 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server mike070/204.90.45.70:22181
2014-11-01 12:54:46,481 WARN org.apache.zookeeper.ClientCnxn: Session
0x1496c7e01e30010 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-11-01 12:54:47,611 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server mike070/204.90.45.70:22181
2014-11-01 12:54:47,611 WARN org.apache.zookeeper.ClientCnxn: Session
0x1496c7e01e30010 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-11-01 12:54:48,234 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server mike070/204.90.45.70:22181
2014-11-01 12:54:48,234 WARN org.apache.zookeeper.ClientCnxn: Session
0x1496c7e01e30002 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-11-01 12:54:49,469 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server mike070/204.90.45.70:22181
2014-11-01 12:54:49,469 WARN org.apache.zookeeper.ClientCnxn: Session
0x1496c7e01e30010 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
-- 
Thanks and regards,
Arghya Kusum Das
(225-362-4031)