You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/01/15 20:54:57 UTC

[GitHub] ptrendx opened a new pull request #13890: Improve bulking in Gluon

ptrendx opened a new pull request #13890: Improve bulking in Gluon
URL: https://github.com/apache/incubator-mxnet/pull/13890
 
 
   ## Description ##
   This PR improves performance of hybridized Gluon models on the GPU by better bulking of ops (running GPU ops without synchronization).
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [x] All changes have test coverage
   - [x] Code is well-documented: 
   - For new C++ functions in header files, their functionalities and arguments are documented. 
   - [x] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [x] For models hybridized with `hybridize()` engine currently creates `ImperativeBulk` op that combines multiple operations into 1. However, this `ImperativeBulk` op captures entire ops, including synchronization at the end. Since the engine dependencies are only updated at the very end of the `ImperativeBulk` op, having this synchronization is wasteful. This PR changes the RunContext structure so that `ImperativeBulk` op is able to inform the bulked ops that it will handle the synchronization itself. Imperative ops check that argument to determine whether they need to perform their own synchronization.
   - [x] For models hybridized with `hybridize(static_alloc=True, static_Shape=True)` there is currently another mechanism of bulking, similar to the one used in symbolic API. However, it very aggressively  excludes ops from being eligible for bulking - any op that produces output from forward pass (which means most of the ops) cannot be bulked. This is artificial limitation and this PR lifts it.
   
   ## Comments ##
   - With this PR `hybridize(static_alloc=True)` and `hybridize(static_alloc=True, static_shape=True)` become pretty similar in performance. The main 2 differences seem to be:
      - a difference in the time it takes to start the forward pass (which I did not yet investigate, but non-static shape seems to take more time to start)
      - the fact that `ImperativeBulk` is a temporary `ThreadedEngineOpr` destroyed at the end of its invocation, which seems to be a relatively expensive operation (hybridized model with everything static mostly does not use `ImperativeBulk` so does not pay the cost of destruction of the object).
   - I will post some performance comparisons in a future comment. 
   
   @eric-haibin-lin @piiswrong 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services