You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by hmcl <gi...@git.apache.org> on 2017/06/24 02:42:07 UTC

[GitHub] storm pull request #2174: STORM-2554: Trident Kafka Spout Refactoring to Inc...

GitHub user hmcl opened a pull request:

    https://github.com/apache/storm/pull/2174

    STORM-2554: Trident Kafka Spout Refactoring to Include Manual Partition Assignment

      - Support manual partition assignment with changes in STORM-2541
      - Improve rebalance pause/resume logic

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hmcl/storm-apache Apache_master_STORM-2554_KTSMPA

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/2174.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2174
    
----
commit 35f2c0c71a28a065e64f523b861acd883b2f2101
Author: Hugo Louro <hm...@gmail.com>
Date:   2017-05-27T17:02:21Z

    STORM-2554: Trident Kafka Spout Refactoring to Include Manual Partition Assignment
      - Support manual partition assignment with changes in STORM-2541
      - Improve rebalance pause/resume logic

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2174: STORM-2554: Trident Kafka Spout Refactoring to Include Ma...

Posted by kristopherkane <gi...@git.apache.org>.
Github user kristopherkane commented on the issue:

    https://github.com/apache/storm/pull/2174
  
    I had a parallel effort which arrived at most of the same conclusions before discovering this JIRA. 
    
    There is a bug with activate/deactivate where the spout continues to call nextTuple() but never emits new messages.  I'm not sure if it is scoped to the manual partition work or is a general bug.  Would you consider a fix here for that or a new JIRA?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2174: STORM-2554: Trident Kafka Spout Refactoring to Include Ma...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the issue:

    https://github.com/apache/storm/pull/2174
  
    @hmcl Neither https://issues.apache.org/jira/browse/STORM-2691 nor https://issues.apache.org/jira/browse/STORM-2473 is assigned to you, and https://issues.apache.org/jira/browse/STORM-2554 doesn't list any of the problems pointed out in either of those issues. 2554's description says it will "do some refactoring to internal state partition management to make it cleaner and more properly handle partitions reassignment", and given the contents of this PR I assumed you were intending for this to make roughly these changes. I was reinforced in that belief by your comments here https://github.com/apache/storm/pull/2174#issuecomment-321627375, which indicated that we were discussing either merging this to 1.x and then another solution to master, or providing a different solution for both branches and closing this PR.
    
    I did speak with you, three weeks ago when we discussed how to fix the Trident spout https://github.com/apache/storm/pull/2174#issuecomment-321623377. Your last comment was https://github.com/apache/storm/pull/2174#issuecomment-322223764, which I took to mean that you basically agreed that we should make the changes, but that you were considering whether we should keep this PR for 1.x. I had no way to know that you were working on implementing fixes for the 2691/2473 issues as part of 2554. I feel like I communicated pretty clearly that I was beginning work on an implementation, both by notifying you here that I thought we should consider a broad rewrite that would possibly make this PR redundant, and by assigning https://issues.apache.org/jira/browse/STORM-2691 to myself (this is visible to everyone subscribed to the storm-issues mailing list). I'd be happy to explicitly mention you in an issue comment before I begin work if we get into a situation like this again.
    
    Could you elaborate on the other instances you're talking about? The only one I can recall is https://github.com/apache/storm/pull/2147, and we discussed the alternative fix a bunch on the mailing list before I touched the code, and you gave no indication that you were intending to work on manual partition assignment. 
    
    I'm sorry about the misunderstanding, and I'm happy to close https://github.com/apache/storm/pull/2300 if you have a better solution. I'm not trying to step on your toes or "steal" your issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2174: STORM-2554: Trident Kafka Spout Refactoring to Include Ma...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the issue:

    https://github.com/apache/storm/pull/2174
  
    @hmcl Sorry for not looking at this earlier, I thought it made sense to finish up 2541 first. Could you fix the conflicts? Then I'd be happy to help review :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2174: STORM-2554: Trident Kafka Spout Refactoring to Include Ma...

Posted by hmcl <gi...@git.apache.org>.
Github user hmcl commented on the issue:

    https://github.com/apache/storm/pull/2174
  
    @srdo I think that you should have spoken with me before implementing this. This task is assigned to me, and I have an implementation that is pending addressing a few changes. There was no timeframe for it, and although I have been busy with other things, I had all the intention to do it. Furthermore, there have also been other instances that I have bounced some ideas in pull request discussions, and you went ahead and implemented them without asking me if I was planing on working on them. I would appreciate that this does not keep happening. Furthermore, I would like to ask what motivated you to provide this implementation without first speaking with me.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2174: STORM-2554: Trident Kafka Spout Refactoring to Include Ma...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the issue:

    https://github.com/apache/storm/pull/2174
  
    @hmcl I went ahead and implemented the changes I suggested here https://github.com/apache/storm/pull/2300. As you noted, it isn't possible to do without breaking the API. Please take a look when you get the chance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2174: STORM-2554: Trident Kafka Spout Refactoring to Include Ma...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the issue:

    https://github.com/apache/storm/pull/2174
  
    I think we should consider if we can improve the way this spout implements the Trident API.
    
    The Trident API is expecting that the Coordinator figures out which partitions exist for a batch. The partitions are passed to the spout executors ("coordinatorMeta"), and it is expected that the Emitter filters that list to get the partitions assigned to itself. Please see https://github.com/apache/storm/blob/64e29f365c9b5d3e15b33f33ab64e200345333e4/storm-client/src/jvm/org/apache/storm/trident/spout/OpaquePartitionedTridentSpoutExecutor.java#L112.
    
    Deciding which partitions exist is the responsibility of the Coordinator, but this implementation puts that responsibility in the Emitter, which causes us to have to hack around with e.g. the enum instance or having to drop an emit because the partition is no longer assigned to this task. It will break if the scheduler happens to put the Coordinator in a different worker from any of the Emitter tasks. The Emitter code ends up being confusing, e.g. refreshPartitions does nothing, but logs based on the coordinatorMeta, which may be different from what the spout is actually assigned. I also think we make it easier to maintain OpaquePartitionedTridentSpoutExecutor if we don't have to keep in mind that the Kafka spout doesn't implement the API in the expected way.
    
    We already have the clean separation we need to implement the API as specified, but they're conflated in the Subscription interface.  The Coordinator should use PartitionFilter and its own KafkaConsumer instance (which we should add) to get the list of batch partitions instead of asking the KafkaTridentSpoutManager, so we get a nice decoupling from the Emitter. We should put the refresh subscription timer in the Coordinator as well. The Emitter should receive these partitions in getPartitionsForTask, use ManualPartitioner to decide which partitions are assigned to the task, and assign them on the consumer. 
    
    We'd need to change either KafkaSpoutConfig or Subscription a bit so we can get at the partitioner and filter classes, but I think it should be doable.
    
    Somewhat related: I noticed that OpaquePartitionedSpoutExecutor was changed in https://github.com/apache/storm/pull/1995 to fix this spout. It might be good to deprecate the getOrderedPartitions/refreshPartitions methods in 1.x so we can remove them from master. Right now the functionality seems duplicated on the Emitter interface, since getPartitionsForTask has the same purpose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2174: STORM-2554: Trident Kafka Spout Refactoring to Include Ma...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the issue:

    https://github.com/apache/storm/pull/2174
  
    @hmcl Sure, take your time. 
    
    Assuming we want to make the changes I suggested, then I'd be in favor of just doing those if we can avoid too much API breakage. Otherwise we could make this change on 1.x and then break the API for 2.0?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2174: STORM-2554: Trident Kafka Spout Refactoring to Include Ma...

Posted by hmcl <gi...@git.apache.org>.
Github user hmcl commented on the issue:

    https://github.com/apache/storm/pull/2174
  
    @srdo I am evaluating if we can do the change without breaking the API. If so we can go ahead with it. Otherwise, as you suggested, we can go with this change for 1.x-branch and then refactor for Storm 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2174: STORM-2554: Trident Kafka Spout Refactoring to Include Ma...

Posted by hmcl <gi...@git.apache.org>.
Github user hmcl commented on the issue:

    https://github.com/apache/storm/pull/2174
  
    @srdo thanks for your comments. Let me think about it. This change builds on the existing code and fixes some of the problems. I was also planning on revisiting this implementation. The way to go about it is either:
    1. Submit this fix and then an improvement with the new design (if we agree it's the right thing to do)
    2. Submit the new design (if we agree it's the right thing to do) without this fix, i.e. on top of the existing code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---