You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Rick Kellogg (JIRA)" <ji...@apache.org> on 2015/10/09 02:29:27 UTC
[jira] [Updated] (STORM-145) worker and supervisor heartbeat to nimbus using socket instead of write zookeeper node

     [ https://issues.apache.org/jira/browse/STORM-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rick Kellogg updated STORM-145:
-------------------------------
    Component/s: storm-core

> worker and supervisor heartbeat to nimbus using socket instead of write zookeeper node
> --------------------------------------------------------------------------------------
>
>                 Key: STORM-145
>                 URL: https://issues.apache.org/jira/browse/STORM-145
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: James Xu
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/732
> when a storm cluster manager over thousands excutors， zookeeper Under great pressure and Very slow to respond write and read, nimbus Judge excutor timeover and reassigment excutors ， all cluster into a dead loop. 
> if worker and supervisor heartbeat to nimbus using socket will resove this problem
> ----------
> d2r: Yes, ZK has trouble keeping up sometimes. See #620.
> Hopefully #706 would help. We have well over 1k workers on a number of storm clusters with this patch, and we no longer see this mass-reassignment happening.
> ----------
> revans2: We also had to tune the FileSystem ZK uses to not be as safe in power failures as it otherwise would be. On ext4 we turned off the barrier my remounting the disk with -o nobarrier. You also want to make sure that your disk's cache is enabled. Unless you have a high end RAID controller with a battery backed cache most admins disable the disk cache on DB boxes to be sure that data is not lost in the case of a power outage. For us the added performance is worth the risk of data loss.
> ----------
> xiaokang: Storm daemons(nimbus, supervisor, worker, etc.) are designed to be stateless so all state including heartbeats are stored in ZK. Workers will continue work on nimbus failure. If supervisor and workers heartbeat directly to nimbus, it may be hard to keep this nice feature.
> ----------
> viceyang: @xiaokang socket heartbeat not break stateless feature。if worker （supervisor）heartbeart to nimbus failure,worker catch the exception and going on work.when nimbus restart heartbeart will success。
> ps： nimbus ha is aonther feature, I am also working on it now.
> @d2r seems #706 work well，but using zk do heartbeart seems not nessary, heartbeart and stats infomation only a Snapshot not need to store， socket heartbeart and stats information store in nimbus memory meet needs。 on aonther way our cluster has exceed 100，000 executors，5 node zk can‘t work well.
> ps: zk load too high key reason is excutors stats information too large when executors too much. store heartbeat and stats information in nimbus' memory is a good way.
> ----------
> d2r: @d2r seems #706 work well，but using zk do heartbeart seems not nessary,
> @vinceyang Yes, I agree: There is discussion about this already --> #620.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)