You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/09/29 05:08:40 UTC

[GitHub] [incubator-mxnet] haojin2 commented on issue #16294: Add CMake flag `CMAKE_BUILD_TYPE=Release`

haojin2 commented on issue #16294: Add CMake flag `CMAKE_BUILD_TYPE=Release`
URL: https://github.com/apache/incubator-mxnet/pull/16294#issuecomment-536249603
 
 
   @marcoabreu IMHO the default settings for building from source should be as close to the ones used for the release versions as possible. `DEBUG` is more of a choice for our developers, isn't it?
   Also considering the fact that our CI does not even test the `DEBUG` build mode and most other builds are not using `DEBUG` as the default, I would consider this PR a step toward a more consistent settings for the different builds.
   I would agree that we need to fix "bugs", let's take a look at this "bug" now. I've personally encountered and fixed several of this "bug" myself. So from my previous experience, this error @hgt312 encountered could be possibly caused by using too large blocks/too many registers/too much shared memory according to this [thread](https://devtalk.nvidia.com/default/topic/452688/cuda-programming-and-performance/error-too-many-resources-requested-for-launch-/post/3220516/#3220516) on CUDA user forum. So now let's examine each one of them:
   1. Could we have different number of GPU thread blocks because we changed the build? NO, it's related with the input size, which was not changed between the 2 test runs with the same test code. Also we mostly use very small tensors for testing, so this could not be the cause.
   2. Could we use different numbers of registers because we changed the build? YES! It's possible that without proper optimization, a vanilla compilation of a complicated GPU kernel could lead to excessive usage of registers.
   3. Could we use too much shared memory because we changed the build? NO, the amount of shared memory needed only depends on the input data's shape and type.
   
   So now we can see that the most possible cause of this is the second one: we may have some complicated GPU kernels that require rewrites (maybe splitting into 2 or more smaller kernels and do multiple launches, or some optimizations to make register usages lower). However, technically this is not "bug" rather than us hitting limitations put on us by the tools and the hardwares we are using. So either we make everything work for everyone (especially the few developers who use DEBUG and run on older GPUs) even when compiler optimizations are poorly done or when the code is run on ancient hardware (at a cost), or, we aim for a working and performant code compiled with proper optimizations for nearly all of our users (who may not even be aware of the build settings), which one would you prefer? Or do you think that this needs a community consensus on it?
   @hgt312 Would you please share the failures so that we know which operators we need to take a look at? So that once the decision on the tradeoff has been made we could take further actions accordingly.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services