You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by joyyoj <gi...@git.apache.org> on 2014/08/01 07:58:36 UTC

[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Github user joyyoj commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50851611
  
    Sorry, I'll soon send a PR. 
    The problem of the original implementation is that the config(host:port) is static and allows only one host:port. Once host or port changed, the flume agent should be restarted to reload the conf.
    To solve it, one solution is to set a virtual address instead of a real address in the flume conf. Meanwhile, a address router was introduced that can tell us all the real addresses are bound to a virtual address and notify such events that a real address is added to or removed from the virtual address.
    I found the router can be easily implemented by the zookeeper. In such scenario:
    1. A spark receiver selects a free port and creates a tmp node with the path /path/to/logicalhost/host:port to zookeeper when started. 
    If three receivers started, three nodes (host1:port1, host2:port2, host3:port3) will be created under /path/to/logicalhost;
    2. On the side of flume agent, the flume sink gets the children nodes (host1:port1, host2:port2, host3:port3) from /path/to/logicalhost and buffers them into a ClientPool.
    When append called, it selects a client from ClientPool in a round-robin manner and call client.append to send events.
    3. If any receiver crashed/started, the tmp zk node will be removed/added, and then ClientPool will remove/add the client from the buffer since it watched those zk children events.
    In my implementation:
    LogicalHostRouter is the implementation of the address router. You know, the spark or flume should not know the existence of zk. 
    The ZkProxy is an encapsulation of the zk curator client.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---