You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Rick Kellogg (JIRA)" <ji...@apache.org> on 2015/10/09 02:38:28 UTC

[jira] [Updated] (STORM-109) Deploying topology with 540 workers caused nimbus to crash.

     [ https://issues.apache.org/jira/browse/STORM-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rick Kellogg updated STORM-109:
-------------------------------
    Component/s: storm-core

> Deploying topology with 540 workers caused nimbus to crash.
> -----------------------------------------------------------
>
>                 Key: STORM-109
>                 URL: https://issues.apache.org/jira/browse/STORM-109
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>            Reporter: James Xu
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/604
> When deploying a topology to a storm cluster and requesting 540 workers, nimbus entered into a continuous exception, die, restart loop printing this in the logs:
> 2013-06-21 02:14:23,551 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] - Exception causing close of session 0x13d7f3c28867c9f due to java.io.IOException: Len error 1277489
> When the topology was killed by nuking the local disk state for that topology, nimbus recovered itself. When the topology was redeployed with less workers, it did not cause nimbus to fail.
> Probably what's happening is nimbus is trying to create a zknode that is larger than 1MB which is the default max size for a zk node.
> One solution is to increase this threshold in zookeeper to a larger value than 1MB.
> ---------
> d2r: This is exactly what we had to do. We did it by setting jute.maxbuffer property to some higher power of 2.
> I think it happened to us because we launched > 12k workers and all of the assignments were serialized at once and written to a zk node at once, and this constantly exceeded the 1MB buffer. I think we ended up using jute.maxbuffer=4097150.
> EDIT: Should also note that workers/supervisor read this zk node after it is written, and the same buffer issue applies. So the childopts for supervisors and workers need to be set in the same manner in addition to nimbus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)