You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by joyyoj <gi...@git.apache.org> on 2014/07/06 16:09:56 UTC

[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

GitHub user joyyoj opened a pull request:

    https://github.com/apache/spark/pull/1310

    SPARK-2201 Improve FlumeInputDStream's stability and make it scalable

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/joyyoj/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1310.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1310
    
----
commit f4660c5cb41d9b5ef737b38e7e38abf3b2f2e31c
Author: joyyoj <su...@gmail.com>
Date:   2014-06-03T13:15:11Z

    [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not read properly

commit fd07a39bcb5e27f3731bfa28b7a1562540a5e6f5
Author: joyyoj <su...@gmail.com>
Date:   2014-07-06T14:01:52Z

    SPARK-2201 Improve FlumeInputDStream's stability and make it scalable

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-64910935
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by joyyoj <gi...@git.apache.org>.
Github user joyyoj commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50851611
  
    Sorry, I'll soon send a PR. 
    The problem of the original implementation is that the config(host:port) is static and allows only one host:port. Once host or port changed, the flume agent should be restarted to reload the conf.
    To solve it, one solution is to set a virtual address instead of a real address in the flume conf. Meanwhile, a address router was introduced that can tell us all the real addresses are bound to a virtual address and notify such events that a real address is added to or removed from the virtual address.
    I found the router can be easily implemented by the zookeeper. In such scenario:
    1. A spark receiver selects a free port and creates a tmp node with the path /path/to/logicalhost/host:port to zookeeper when started. 
    If three receivers started, three nodes (host1:port1, host2:port2, host3:port3) will be created under /path/to/logicalhost;
    2. On the side of flume agent, the flume sink gets the children nodes (host1:port1, host2:port2, host3:port3) from /path/to/logicalhost and buffers them into a ClientPool.
    When append called, it selects a client from ClientPool in a round-robin manner and call client.append to send events.
    3. If any receiver crashed/started, the tmp zk node will be removed/added, and then ClientPool will remove/add the client from the buffer since it watched those zk children events.
    In my implementation:
    LogicalHostRouter is the implementation of the address router. You know, the spark or flume should not know the existence of zk. 
    The ZkProxy is an encapsulation of the zk curator client.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-57574730
  
    Hi @joyyoj,
    
    Since this pull request doesn't show any code / changes, do you mind closing it?  Feel free to update / re-open if you have code that you'd like us to review.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-64916501
  
    Let's close this issue


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by joyyoj <gi...@git.apache.org>.
Github user joyyoj commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50708050
  
    @tdas, Thanks for noticing the PR. I’m pleased to share my design idea. I'll update it this weekend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50730746
  
    @harishreedharan Can you take a look? This looks really interesting for Flume.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by harishreedharan <gi...@git.apache.org>.
Github user harishreedharan commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50787904
  
    Hmm, I don't see any code. Shows +0, -0 lines. Something went wrong in the last merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50794944
  
    @joyyoj Something went wrong in your last merge. Its an empty patch now!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by harishreedharan <gi...@git.apache.org>.
Github user harishreedharan commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-54735995
  
    I still don't see any code. Did a merge fail somewhere?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1310


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-48112801
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by joyyoj <gi...@git.apache.org>.
Github user joyyoj commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50996150
  
    To PR #807, if some flume agent crashed and restarted from another host, spark should be restarted to reload conf ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by harishreedharan <gi...@git.apache.org>.
Github user harishreedharan commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50996774
  
    @joyyoj I will take a look at it in the next couple days. As far as #807 is concerned - yes, if the flume agent's location changes, the config needs to change. In my experience (I work for a company that has a large number of Flume customers), Flume agents are usually deployed on specific nodes and if they crash - they are restarted on the same node - since Flume has no concept of workers (every agent is a worker), so that was not a concern in my design.
    
    The ZK-based config seems interesting. I will take a look at it soon. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-54694627
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by joyyoj <gi...@git.apache.org>.
Github user joyyoj commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50995940
  
    @harishreedharan The time I am confronted with this problem, PR #807 is not merged to trunk. I think PR #807 is another solution to solve the same problem and quiet good.
    I think it  is still valuable to solve it by introducing a host router level since this problem seems a common issue, and a host router level can be reused.
    PR #1755 resubmitted
    How do you think of it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by harishreedharan <gi...@git.apache.org>.
Github user harishreedharan commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50915250
  
    @joyyoj Thanks for the explanation. This makes quite a lot of sense. I recently added a new Dstream + an associated Flume sink to fix the issue of receivers being hard-coded on the Flume config. Basically solves the same issue, by telling the Spark receiver where the Flume agents are running. So even if the executors die, they can come back and simply poll the same Flume agents for data. In my experience, the hosts on which the agents are running rarely change - so this solution works nicely. PR# 807 - let me know what you think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/1310#issuecomment-50663623
  
    Hey @joyyoj 
    This is a very interesting patch, and can be very useful! But its a little hard to understand the architecture from the code. Could you provide us with a simple design doc that explains whats the architecture, and how each class and architectural components are related?
    
    PS: Apologies for the not having commented on this earlier. Fell through the cracks I guess.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---