You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by kevpeek <gi...@git.apache.org> on 2017/10/19 15:58:14 UTC

[GitHub] storm pull request #2379: STORM-2782 - refactor partial key grouping to make...

GitHub user kevpeek opened a pull request:

    https://github.com/apache/storm/pull/2379

    STORM-2782 - refactor partial key grouping to make it more flexible a…

    …nd reusable

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kevpeek/storm STORM-2782

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/2379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2379
    
----
commit 5d2b30e1656942c3624c63fd5fefbd6b5729dac6
Author: Kevin Peek <kp...@salesforce.com>
Date:   2017-08-31T13:27:49Z

    STORM-2782 - refactor partial key grouping to make it more flexible and reusable

----


---

[GitHub] storm issue #2379: STORM-2782 - refactor partial key grouping to make it mor...

Posted by kevpeek <gi...@git.apache.org>.

Github user kevpeek commented on the issue:

    https://github.com/apache/storm/pull/2379
  
    I went ahead and replaced the original hash function based logic with something faster. This code now runs in about 60% of the time taken by the original.
    
    As for changing the hash logic, I ran tests to measure the resulting tuple distribution across downstream tasks. There is not a huge difference, but the new code produces a slightly more even distribution. This is primarily because it avoids the case where the two hash functions choose the same task, thus eliminating the power of two choices.
    
    @revans2


---

[GitHub] storm pull request #2379: STORM-2782 - refactor partial key grouping to make...

Posted by revans2 <gi...@git.apache.org>.

Github user revans2 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2379#discussion_r145983226
  
    --- Diff: storm-client/test/jvm/org/apache/storm/grouping/partialKeyGrouping/PartialKeyGroupingTest.java ---
    @@ -53,7 +62,7 @@ public void testChooseTasksFields() {
             PartialKeyGrouping pkg = new PartialKeyGrouping(new Fields("test"));
             WorkerTopologyContext context = mock(WorkerTopologyContext.class);
             when(context.getComponentOutputFields(any(GlobalStreamId.class))).thenReturn(new Fields("test"));
    -        pkg.prepare(context, mock(GlobalStreamId.class), Lists.newArrayList(0, 1, 2, 3, 4, 5));
    +        pkg.prepare(context, mock(GlobalStreamId.class), Lists.newArrayList(9, 8, 7, 1, 2, 3));
    --- End diff --
    
    Sounds reasonable.  Just FYI in reality they are likely to be consecutive increasing numbers because we assign the IDs by component in a consecutive increasing fashion.


---

[GitHub] storm pull request #2379: STORM-2782 - refactor partial key grouping to make...

Posted by revans2 <gi...@git.apache.org>.

Github user revans2 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2379#discussion_r145976356
  
    --- Diff: storm-client/test/jvm/org/apache/storm/grouping/partialKeyGrouping/PartialKeyGroupingTest.java ---
    @@ -53,7 +62,7 @@ public void testChooseTasksFields() {
             PartialKeyGrouping pkg = new PartialKeyGrouping(new Fields("test"));
             WorkerTopologyContext context = mock(WorkerTopologyContext.class);
             when(context.getComponentOutputFields(any(GlobalStreamId.class))).thenReturn(new Fields("test"));
    -        pkg.prepare(context, mock(GlobalStreamId.class), Lists.newArrayList(0, 1, 2, 3, 4, 5));
    +        pkg.prepare(context, mock(GlobalStreamId.class), Lists.newArrayList(9, 8, 7, 1, 2, 3));
    --- End diff --
    
    Why are we changing this?  The test should still pass just fine with the original values, and if not it is a bug.  So why change them?


---

[GitHub] storm pull request #2379: STORM-2782 - refactor partial key grouping to make...

Posted by kevpeek <gi...@git.apache.org>.

Github user kevpeek closed the pull request at:

    https://github.com/apache/storm/pull/2379


---

[GitHub] storm issue #2379: STORM-2782 - refactor partial key grouping to make it mor...

Posted by kevpeek <gi...@git.apache.org>.

Github user kevpeek commented on the issue:

    https://github.com/apache/storm/pull/2379
  
    I will go collect some performance numbers and get back to you.


---

[GitHub] storm issue #2379: STORM-2782 - refactor partial key grouping to make it mor...

Posted by kevpeek <gi...@git.apache.org>.

Github user kevpeek commented on the issue:

    https://github.com/apache/storm/pull/2379
  
    @revans2 I ran some perf tests, and this refactor did slow things down considerably. I have made a small change that eliminates the main source of slowness, but it's still not great.
    
    The current version of the code runs in about 150% of the time the original code does, so certainly not good. Most of the increase in runtime comes from the more general task selection algorithm. I have been able to speed things up a little bit more, but I suspect the benefits are not really worth it. Perhaps that means this code simply doesn't need to be merged in, or it could be added as a separate grouping, but here are a few observations I made along the way:
    
    * I did not notice the slowdown in our deployment of this code because we know our key set is only a few thousand elements, allowing us to wrap the AssignmentCreator in a caching decorator. This trades a larger memory footprint for better performance. With the caching AssignmentCreator decorator, this grouping runs in 80% of the time the original takes. This is sort of the point of the refactor; customizations like this become possible with just a few lines of code.
    * By far, the slowest part of this grouping is the hashing functions in createAssignment. Perhaps this is one of the optimizations you mention above, but replacing these with something faster would be a bigger win than optimizing this refactor.


---

[GitHub] storm issue #2379: STORM-2782 - refactor partial key grouping to make it mor...

Posted by kevpeek <gi...@git.apache.org>.

Github user kevpeek commented on the issue:

    https://github.com/apache/storm/pull/2379
  
    heh. It crossed my mind to change those to arrays, but I figured the jvm would render the difference negligible. A quick test proves that I was very wrong. I'll tweak that tomorrow and run some more legitimate performance numbers, but it looks like that change alone closes the gap to just a few percent.


---

[GitHub] storm pull request #2379: STORM-2782 - refactor partial key grouping to make...

Posted by kevpeek <gi...@git.apache.org>.

Github user kevpeek commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2379#discussion_r145979224
  
    --- Diff: storm-client/test/jvm/org/apache/storm/grouping/partialKeyGrouping/PartialKeyGroupingTest.java ---
    @@ -53,7 +62,7 @@ public void testChooseTasksFields() {
             PartialKeyGrouping pkg = new PartialKeyGrouping(new Fields("test"));
             WorkerTopologyContext context = mock(WorkerTopologyContext.class);
             when(context.getComponentOutputFields(any(GlobalStreamId.class))).thenReturn(new Fields("test"));
    -        pkg.prepare(context, mock(GlobalStreamId.class), Lists.newArrayList(0, 1, 2, 3, 4, 5));
    +        pkg.prepare(context, mock(GlobalStreamId.class), Lists.newArrayList(9, 8, 7, 1, 2, 3));
    --- End diff --
    
    I changed these because at one point during my refactoring the test passed purely by coincidence with consecutive, increasing numbers. I wanted to ensure that was not the case, although I don't remember the specifics of which code caused that.


---

[GitHub] storm issue #2379: STORM-2782 - refactor partial key grouping to make it mor...

Posted by kevpeek <gi...@git.apache.org>.

Github user kevpeek commented on the issue:

    https://github.com/apache/storm/pull/2379
  
    current performance is between 3% and 12% slower than the old version.
    
    @revans2 Would you elaborate on what you meant by, "pre-allocate the possible return values as lists?"
    
    Thanks.


---

[GitHub] storm pull request #2379: STORM-2782 - refactor partial key grouping to make...

Posted by kevpeek <gi...@git.apache.org>.

Github user kevpeek commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2379#discussion_r145984652
  
    --- Diff: storm-client/test/jvm/org/apache/storm/grouping/partialKeyGrouping/PartialKeyGroupingTest.java ---
    @@ -53,7 +62,7 @@ public void testChooseTasksFields() {
             PartialKeyGrouping pkg = new PartialKeyGrouping(new Fields("test"));
             WorkerTopologyContext context = mock(WorkerTopologyContext.class);
             when(context.getComponentOutputFields(any(GlobalStreamId.class))).thenReturn(new Fields("test"));
    -        pkg.prepare(context, mock(GlobalStreamId.class), Lists.newArrayList(0, 1, 2, 3, 4, 5));
    +        pkg.prepare(context, mock(GlobalStreamId.class), Lists.newArrayList(9, 8, 7, 1, 2, 3));
    --- End diff --
    
    I can add another test that does use consecutive numbers. I simply wanted to ensure the code did not require consecutive numbers.


---

[GitHub] storm issue #2379: STORM-2782 - refactor partial key grouping to make it mor...

Posted by revans2 <gi...@git.apache.org>.

Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2379
  
    @kevpeek yes the JVM can be very dumb when it comes to arrays vs Lists.
    
    That sounds great if we can be within a few percentage points I would be happy to merge this in.


---

[GitHub] storm issue #2379: STORM-2782 - refactor partial key grouping to make it mor...

Posted by revans2 <gi...@git.apache.org>.

Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2379
  
    @kevpeek I agree that it might not be worth merging it right now.  But there is room for improvement.
    
    The first thing I would do is to pre-allocate the possible return values as lists.  We should avoid as much memory allocation as possible within chooseTasks.
    
    assignmentCreator.createAssignment is also returning a newly created object.  We could at least turn that into an int array, so there is only one object allocated instead of 3.
    
    You might also consider refactoring createAssignment to take a reference to targetSelector so you don't need to create a class that is an intermediary between the two.  Just have targetSelector take two ints and return one of them, no object allocation at all.
    
    If you are still having performance issues after that I can probably help with some profiling.  


---

[GitHub] storm issue #2379: STORM-2782 - refactor partial key grouping to make it mor...

Posted by revans2 <gi...@git.apache.org>.

Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2379
  
    I am a conditional +1 on this change assuming that the performance numbers are within a few percent of the original.


---

[GitHub] storm issue #2379: STORM-2782 - refactor partial key grouping to make it mor...

Posted by HeartSaVioR <gi...@git.apache.org>.

Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/2379
  
    @kevpeek 
    You can refer another implementation to see how we optimize it.
    
    https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/grouping/LoadAwareShuffleGrouping.java
    
    For starting point you can see how we preallocate possible return values for `chooseTasks` and return it. That's what @revans2 pointed out.


---

[GitHub] storm issue #2379: STORM-2782 - refactor partial key grouping to make it mor...

Posted by kevpeek <gi...@git.apache.org>.

Github user kevpeek commented on the issue:

    https://github.com/apache/storm/pull/2379
  
    new PR with this squashed to one commit: https://github.com/apache/storm/pull/2395


---