You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by revans2 <gi...@git.apache.org> on 2018/04/26 20:11:17 UTC

[GitHub] storm pull request #2647: STORM-3040: Improve scheduler performance

GitHub user revans2 opened a pull request:

    https://github.com/apache/storm/pull/2647

    STORM-3040: Improve scheduler performance

    There are a lot of different scheduler improvements.  Mostly these are either caching, storing data in multiple ways so we can look it up quickly, and finally lazily sorting nodes in a rack only when it is needed, instead of all ahead of time.
    
    I also added in performance tests.  They currently pass on travis, but I would like to hear from others on if this solution looks good or if there is a better way for us to do performance testing.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/revans2/incubator-storm STORM-3040

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/2647.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2647
    
----
commit 8d3e5cf0e6f7f90d3007159d30c3456bd9749b1f
Author: Robert (Bobby) Evans <ev...@...>
Date:   2018-04-24T20:19:32Z

    STORM-3040: Improve scheduler performance

----


---

[GitHub] storm issue #2647: STORM-3040: Improve scheduler performance

Posted by danny0405 <gi...@git.apache.org>.
Github user danny0405 commented on the issue:

    https://github.com/apache/storm/pull/2647
  
    @revans2
    Thx for your work, the storm-core travis check still fails, we should fix that.


---

[GitHub] storm issue #2647: STORM-3040: Improve scheduler performance

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2647
  
    @danny0405 @kishorvpatil With some recent changes to master my patch started to fail with some checkstyle issues.  I have rebased and fixed all of the issues.  Please take a look again, specifically the second commit and let me know.


---

[GitHub] storm issue #2647: STORM-3040: Improve scheduler performance

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2647
  
    @danny0405 In my tests `TestResourceAwareScheduler.testLargeTopologiesCommon` went from about 7 mins to about 7 seconds. For `TestResourceAwareScheduler.testLargeTopologiesOnLargeClustersGras` I don't have a before value because I killed it after an hour.  The after is about 7 seconds per topology, or about a min and a half.


---

[GitHub] storm issue #2647: STORM-3040: Improve scheduler performance

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2647
  
    @danny0405 the failure is a known race condition around netty and is not related to this change. 


---

[GitHub] storm issue #2647: STORM-3040: Improve scheduler performance

Posted by danny0405 <gi...@git.apache.org>.
Github user danny0405 commented on the issue:

    https://github.com/apache/storm/pull/2647
  
    @revans2 
    I approve with you promotion totally.
    
    The only concern are all kinds of cache we use here, now storm has many caches not just for scheduling. I just think if we can make a disk-storage-backend for all of such caches master needs. Disk cache has better fault-tolerance and is much cheeper than memory, but this is another promotion direction, and has nothing to do with this patch.
    
    BTW: thx for your nice work.



---

[GitHub] storm issue #2647: STORM-3040: Improve scheduler performance

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2647
  
    @danny0405 I added in the comments about thread safety like you suggested.


---

[GitHub] storm pull request #2647: STORM-3040: Improve scheduler performance

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2647#discussion_r186754134
  
    --- Diff: storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java ---
    @@ -48,6 +49,9 @@
     
     public class Cluster implements ISchedulingState {
         private static final Logger LOG = LoggerFactory.getLogger(Cluster.class);
    +    private static final Function<String, Set<WorkerSlot>> MAKE_SET = (x) -> new HashSet<>();
    +    private static final Function<String, Map<WorkerSlot, NormalizedResourceRequest>> MAKE_MAP = (x) -> new HashMap<>();
    --- End diff --
    
    I am happy to add in a comment to Cluster itself about it, as none of Cluster is currently thread safe.
    
    As for parallel scheduling the plan that we had been thinking about was more around scheduling multiple topologies in parallel, rather then trying to make a single scheduler strategy multi-threaded, but both have advantages and disadvantages.


---

[GitHub] storm issue #2647: STORM-3040: Improve scheduler performance

Posted by danny0405 <gi...@git.apache.org>.
Github user danny0405 commented on the issue:

    https://github.com/apache/storm/pull/2647
  
    @revans2 
    In total, this is a good promotion direction.
    
    So this promotion mainly focus on repetitive computation and some eagerly computed data structure, right?
    
    What is the average promotion percentage when applied this patch?


---

[GitHub] storm pull request #2647: STORM-3040: Improve scheduler performance

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2647#discussion_r186756808
  
    --- Diff: storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java ---
    @@ -763,6 +773,7 @@ public void setAssignments(
                 assertValidTopologyForModification(assignment.getTopologyId());
             }
             assignments.clear();
    +        totalResourcesPerNodeCache.clear();
    --- End diff --
    
    I tried that, but it didn't have the performance boost I was hoping for.  The vast majority of the performance problem came from recomputing the value each time we wanted to sort, with for GRAS is once per executor.  So without the cache for a large topology we were recomputing things hundreds of thousands of times.  With the cache it is only how many nodes are in the cluster, which ends up being relatively small.  In reality the noise between runs drowned out any improvement, so I opted to not do the change.


---

[GitHub] storm pull request #2647: STORM-3040: Improve scheduler performance

Posted by danny0405 <gi...@git.apache.org>.
Github user danny0405 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2647#discussion_r186665919
  
    --- Diff: storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java ---
    @@ -763,6 +773,7 @@ public void setAssignments(
                 assertValidTopologyForModification(assignment.getTopologyId());
             }
             assignments.clear();
    +        totalResourcesPerNodeCache.clear();
    --- End diff --
    
    Actually we can reuse these cache for the next scheduling round, when we bring in central master cache in.


---

[GitHub] storm pull request #2647: STORM-3040: Improve scheduler performance

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/storm/pull/2647


---

[GitHub] storm issue #2647: STORM-3040: Improve scheduler performance

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2647
  
    Oh I forgot I also added back in something I messed up before and added back in anti-affinity to GRAS.


---

[GitHub] storm pull request #2647: STORM-3040: Improve scheduler performance

Posted by danny0405 <gi...@git.apache.org>.
Github user danny0405 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2647#discussion_r186665266
  
    --- Diff: storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java ---
    @@ -48,6 +49,9 @@
     
     public class Cluster implements ISchedulingState {
         private static final Logger LOG = LoggerFactory.getLogger(Cluster.class);
    +    private static final Function<String, Set<WorkerSlot>> MAKE_SET = (x) -> new HashSet<>();
    +    private static final Function<String, Map<WorkerSlot, NormalizedResourceRequest>> MAKE_MAP = (x) -> new HashMap<>();
    --- End diff --
    
    HashSet is ok for now single daemon scheduling, we may make it thread safe when we want to support parallel scheduling, so can we add a comment about thread safe here?


---