You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hama.apache.org by Sascha Jonas <sa...@student.HTW-Berlin.de> on 2013/06/16 14:22:09 UTC

Zookeeper - Problems?

Hey,

iam using Apache Hama on a small cluster with two computers. Its working
fine with a small number of supersteps but every time i am trying with
lots of iterations e.g. 10000 it crashes.

Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are
still running while the log shows some errors.

Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a
newer Hama or Zookeeper version?

13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path
/bsp/job_201306091733_0009/sync/4276
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /bsp/job_201306091733_0009/sync/4276
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
	at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
	at org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138)
	at org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290)
	at
org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99)
	at org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
	at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
	at
de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
	at
de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
	at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
	at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
	at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
13/06/16 00:14:15 ERROR
distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer:
org.apache.hama.bsp.sync.SyncException
org.apache.hama.bsp.sync.SyncException
	at
org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137)
	at org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
	at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
	at
de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
	at
de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
	at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
	at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
	at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)


Re: Zookeeper - Problems?

Posted by Chia-Hung Lin <cl...@googlemail.com>.
Network traffic loading may depend on bandwidth, packet size,
frequency of communication, etc, even though they are reserved
instances. For example, in a scenario where only 2 servers are running
in a network; and server A floods messages (large package size or
higher frequency sending the messages) to its peer server B that may
lead to the B server unresponsive or unable to respond in time.



On 17 June 2013 18:24, Edward J. Yoon <ed...@apache.org> wrote:
> Please see the zookeeper logs to figure out the reason of
> ConnectionLossException. There are many possibilities such as FullGC,
> heavy swap space usage, or session expired.
>
> I guess, the answer will be in the sentence "stopped working after
> 4600 supersteps".
>
> On Mon, Jun 17, 2013 at 6:11 PM, Sascha Jonas
> <sa...@student.htw-berlin.de> wrote:
>> The servers are reserved for Apache Hama, so there is no other network
>> traffic. I tested it on three other PCs at another location but with the
>> same configuration and got the same errors :(
>>
>> Am So, 16.06.2013, 16:44 schrieb Chia-Hung Lin:
>>> Have you checked if underlying network traffic is busy when error happens?
>>>
>>> Can't be very sure but the symptom seems to be the heavy network
>>> traffic leads to the zk connection lost.
>>>
>>>
>>>
>>> On 16 June 2013 20:22, Sascha Jonas <sa...@student.htw-berlin.de>
>>> wrote:
>>>> Hey,
>>>>
>>>> iam using Apache Hama on a small cluster with two computers. Its working
>>>> fine with a small number of supersteps but every time i am trying with
>>>> lots of iterations e.g. 10000 it crashes.
>>>>
>>>> Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are
>>>> still running while the log shows some errors.
>>>>
>>>> Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a
>>>> newer Hama or Zookeeper version?
>>>>
>>>> 13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path
>>>> /bsp/job_201306091733_0009/sync/4276
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for
>>>> /bsp/job_201306091733_0009/sync/4276
>>>>         at
>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>>>>         at
>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>>>>         at
>>>> org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138)
>>>>         at
>>>> org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290)
>>>>         at
>>>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99)
>>>>         at
>>>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>>>>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>>>>         at
>>>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>>>>         at
>>>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>>>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>>>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>>>>         at
>>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
>>>> 13/06/16 00:14:15 ERROR
>>>> distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer:
>>>> org.apache.hama.bsp.sync.SyncException
>>>> org.apache.hama.bsp.sync.SyncException
>>>>         at
>>>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137)
>>>>         at
>>>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>>>>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>>>>         at
>>>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>>>>         at
>>>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>>>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>>>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>>>>         at
>>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
>>>>
>>>
>>
>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon

Re: Zookeeper - Problems?

Posted by "Edward J. Yoon" <ed...@apache.org>.
Please see the zookeeper logs to figure out the reason of
ConnectionLossException. There are many possibilities such as FullGC,
heavy swap space usage, or session expired.

I guess, the answer will be in the sentence "stopped working after
4600 supersteps".

On Mon, Jun 17, 2013 at 6:11 PM, Sascha Jonas
<sa...@student.htw-berlin.de> wrote:
> The servers are reserved for Apache Hama, so there is no other network
> traffic. I tested it on three other PCs at another location but with the
> same configuration and got the same errors :(
>
> Am So, 16.06.2013, 16:44 schrieb Chia-Hung Lin:
>> Have you checked if underlying network traffic is busy when error happens?
>>
>> Can't be very sure but the symptom seems to be the heavy network
>> traffic leads to the zk connection lost.
>>
>>
>>
>> On 16 June 2013 20:22, Sascha Jonas <sa...@student.htw-berlin.de>
>> wrote:
>>> Hey,
>>>
>>> iam using Apache Hama on a small cluster with two computers. Its working
>>> fine with a small number of supersteps but every time i am trying with
>>> lots of iterations e.g. 10000 it crashes.
>>>
>>> Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are
>>> still running while the log shows some errors.
>>>
>>> Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a
>>> newer Hama or Zookeeper version?
>>>
>>> 13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path
>>> /bsp/job_201306091733_0009/sync/4276
>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> KeeperErrorCode = ConnectionLoss for
>>> /bsp/job_201306091733_0009/sync/4276
>>>         at
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>>>         at
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>>>         at
>>> org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138)
>>>         at
>>> org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290)
>>>         at
>>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99)
>>>         at
>>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>>>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>>>         at
>>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>>>         at
>>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>>>         at
>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
>>> 13/06/16 00:14:15 ERROR
>>> distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer:
>>> org.apache.hama.bsp.sync.SyncException
>>> org.apache.hama.bsp.sync.SyncException
>>>         at
>>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137)
>>>         at
>>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>>>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>>>         at
>>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>>>         at
>>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>>>         at
>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
>>>
>>
>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Zookeeper - Problems?

Posted by Sascha Jonas <sa...@student.HTW-Berlin.de>.
The servers are reserved for Apache Hama, so there is no other network
traffic. I tested it on three other PCs at another location but with the
same configuration and got the same errors :(

Am So, 16.06.2013, 16:44 schrieb Chia-Hung Lin:
> Have you checked if underlying network traffic is busy when error happens?
>
> Can't be very sure but the symptom seems to be the heavy network
> traffic leads to the zk connection lost.
>
>
>
> On 16 June 2013 20:22, Sascha Jonas <sa...@student.htw-berlin.de>
> wrote:
>> Hey,
>>
>> iam using Apache Hama on a small cluster with two computers. Its working
>> fine with a small number of supersteps but every time i am trying with
>> lots of iterations e.g. 10000 it crashes.
>>
>> Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are
>> still running while the log shows some errors.
>>
>> Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a
>> newer Hama or Zookeeper version?
>>
>> 13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path
>> /bsp/job_201306091733_0009/sync/4276
>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>> KeeperErrorCode = ConnectionLoss for
>> /bsp/job_201306091733_0009/sync/4276
>>         at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>>         at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>>         at
>> org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138)
>>         at
>> org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290)
>>         at
>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99)
>>         at
>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>>         at
>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>>         at
>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>>         at
>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
>> 13/06/16 00:14:15 ERROR
>> distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer:
>> org.apache.hama.bsp.sync.SyncException
>> org.apache.hama.bsp.sync.SyncException
>>         at
>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137)
>>         at
>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>>         at
>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>>         at
>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>>         at
>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
>>
>



Re: Zookeeper - Problems?

Posted by Chia-Hung Lin <cl...@googlemail.com>.
Have you checked if underlying network traffic is busy when error happens?

Can't be very sure but the symptom seems to be the heavy network
traffic leads to the zk connection lost.



On 16 June 2013 20:22, Sascha Jonas <sa...@student.htw-berlin.de> wrote:
> Hey,
>
> iam using Apache Hama on a small cluster with two computers. Its working
> fine with a small number of supersteps but every time i am trying with
> lots of iterations e.g. 10000 it crashes.
>
> Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are
> still running while the log shows some errors.
>
> Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a
> newer Hama or Zookeeper version?
>
> 13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path
> /bsp/job_201306091733_0009/sync/4276
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /bsp/job_201306091733_0009/sync/4276
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>         at org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138)
>         at org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290)
>         at
> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99)
>         at org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>         at
> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>         at
> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>         at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
> 13/06/16 00:14:15 ERROR
> distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer:
> org.apache.hama.bsp.sync.SyncException
> org.apache.hama.bsp.sync.SyncException
>         at
> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137)
>         at org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>         at
> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>         at
> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>         at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
>