You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/08/23 19:47:00 UTC

[GitHub] [incubator-mxnet] szha opened a new pull request #18987: [ENV] update runtime setting default values

szha opened a new pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987

## Description ##
update runtime setting default values for resource copies, mem pool type

## Checklist ##
### Essentials ###
Please feel free to remove inapplicable items for your PR.
- [ ] Changes are complete (i.e. I finished coding on this PR)
- [ ] All changes have test coverage:
- Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
- Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
- Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
- [ ] Code is well-documented:
- For user-facing API changes, API doc string has been updated.
- For new C++ functions in header files, their functionalities and arguments are documented.
- For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
- Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
- [ ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

### Changes ###
- [ ] Feature1, tests, (and when applicable, API doc)
- [ ] Feature2, tests, (and when applicable, API doc)

## Comments ##
- If this change is a backward incompatible change, why must this change be made.
- Interesting edge cases to note here

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#issuecomment-681523835


   thanks for the reviews. @zhreshold how about setting the pool strategy to be naive when shapes are static?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#issuecomment-678932225


   @sxjscience @zhreshold it would be great to have some data on how such defaults work for the cv and nlp models.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on a change in pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha commented on a change in pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#discussion_r476040201



##########
File path: src/storage/storage.cc
##########
@@ -67,8 +67,9 @@ StorageManager *CreateStorageManager(const Context &ctx, const char *context,
                                      int num_gpu_device, std::string *pStrategy) {
   const auto env_var = env_var_name(context, pool_type);
   const char *type = getenv(env_var.c_str());
-  if (type == nullptr)
-    type = "Naive";   // default pool
+  if (type == nullptr) {
+    type = "Round";   // default pool

Review comment:
       My hope is of course to provide a good out-of-the-box usage experience to mxnet users. From what I observed, there seems to be more models with dynamic shape inputs than static ones, and many of the static-shape models can still run in this setting, hence the proposal.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#issuecomment-679455893


   > I think CNNs are generally static shape while models in NLP are generally dynamic shape.
   
   I don't think we can generalize like this. For example, object detection and segmentation are based on CNN and are usually not static-shaped.
   
   > Do we have any plan for improving the memory usage?
   
   Of course we do. I think @ArmageddonKnight is currently fixing some missed allocation entries in memory profiler, and plans on developing a memory usage visualization tool later this week to help narrow down the focus for memory optimization. We also intend to add mirror option to cached op to allow training for larger model.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] yzhliu commented on a change in pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

yzhliu commented on a change in pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#discussion_r475940272



##########
File path: src/resource.cc
##########
@@ -96,9 +96,9 @@ class ResourceManagerImpl : public ResourceManager {
     cpu_temp_space_copy_ = dmlc::GetEnv("MXNET_CPU_TEMP_COPY", 4);
     gpu_temp_space_copy_ = dmlc::GetEnv("MXNET_GPU_TEMP_COPY", 1);
     cpu_native_rand_copy_ = dmlc::GetEnv("MXNET_CPU_PARALLEL_RAND_COPY", 1);
-    gpu_native_rand_copy_ = dmlc::GetEnv("MXNET_GPU_PARALLEL_RAND_COPY", 4);
+    gpu_native_rand_copy_ = dmlc::GetEnv("MXNET_GPU_PARALLEL_RAND_COPY", 1);
 #if MXNET_USE_CUDNN == 1
-    gpu_cudnn_dropout_state_copy_ = dmlc::GetEnv("MXNET_GPU_CUDNN_DROPOUT_STATE_COPY", 4);
+    gpu_cudnn_dropout_state_copy_ = dmlc::GetEnv("MXNET_GPU_CUDNN_DROPOUT_STATE_COPY", 1);

Review comment:
       should be fine as long as there's no much competition for gpu rnd seed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha merged pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha merged pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#issuecomment-678816378


   Hey @szha , Thanks for submitting the PR 
   All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands: 
   - To trigger all jobs: @mxnet-bot run ci [all] 
   - To trigger specific jobs: @mxnet-bot run ci [job1, job2] 
   *** 
   **CI supported jobs**: [centos-cpu, windows-gpu, centos-gpu, unix-cpu, sanity, windows-cpu, miscellaneous, clang, website, edge, unix-gpu]
   *** 
   _Note_: 
    Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin. 
   All CI tests must pass before the PR can be merged. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] sxjscience commented on pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

sxjscience commented on pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#issuecomment-679454243


   I think CNNs are generally static shape while models in NLP are generally dynamic shape. Do we have any plan for improving the memory usage?
   
   Get Outlook for iOS<https://aka.ms/o0ukef>
   ________________________________
   From: Sheng Zha <no...@github.com>
   Sent: Monday, August 24, 2020 6:30:41 PM
   To: apache/incubator-mxnet <in...@noreply.github.com>
   Cc: Xingjian SHI <xs...@connect.ust.hk>; Mention <me...@noreply.github.com>
   Subject: Re: [apache/incubator-mxnet] [ENV] update runtime setting default values (#18987)
   
   
   @szha commented on this pull request.
   
   ________________________________
   
   In src/storage/storage.cc<https://github.com/apache/incubator-mxnet/pull/18987#discussion_r476040201>:
   
   > @@ -67,8 +67,9 @@ StorageManager *CreateStorageManager(const Context &ctx, const char *context,
                                         int num_gpu_device, std::string *pStrategy) {
      const auto env_var = env_var_name(context, pool_type);
      const char *type = getenv(env_var.c_str());
   -  if (type == nullptr)
   -    type = "Naive";   // default pool
   +  if (type == nullptr) {
   +    type = "Round";   // default pool
   
   
   My hope is of course to provide a good out-of-the-box usage experience to mxnet users. From what I observed, there seems to be more models with dynamic shape inputs than static ones, and many of the static-shape models can still run in this setting, hence the proposal.
   
   —
   You are receiving this because you were mentioned.
   Reply to this email directly, view it on GitHub<https://github.com/apache/incubator-mxnet/pull/18987#discussion_r476040201>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABHQH3TMASLXDEBXVJXNACLSCMH4DANCNFSM4QI2O3KQ>.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] zhreshold commented on a change in pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

zhreshold commented on a change in pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#discussion_r475953426



##########
File path: src/storage/storage.cc
##########
@@ -67,8 +67,9 @@ StorageManager *CreateStorageManager(const Context &ctx, const char *context,
                                      int num_gpu_device, std::string *pStrategy) {
   const auto env_var = env_var_name(context, pool_type);
   const char *type = getenv(env_var.c_str());
-  if (type == nullptr)
-    type = "Naive";   // default pool
+  if (type == nullptr) {
+    type = "Round";   // default pool

Review comment:
       it's obvious that round can help speed up certain dynamic input workloads, but tends to oom more frequently. I suggest we much very cautious about changing the default to round, unless there's a good fallback for oom handling.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on a change in pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha commented on a change in pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#discussion_r475947573



##########
File path: src/resource.cc
##########
@@ -96,9 +96,9 @@ class ResourceManagerImpl : public ResourceManager {
     cpu_temp_space_copy_ = dmlc::GetEnv("MXNET_CPU_TEMP_COPY", 4);
     gpu_temp_space_copy_ = dmlc::GetEnv("MXNET_GPU_TEMP_COPY", 1);
     cpu_native_rand_copy_ = dmlc::GetEnv("MXNET_CPU_PARALLEL_RAND_COPY", 1);
-    gpu_native_rand_copy_ = dmlc::GetEnv("MXNET_GPU_PARALLEL_RAND_COPY", 4);
+    gpu_native_rand_copy_ = dmlc::GetEnv("MXNET_GPU_PARALLEL_RAND_COPY", 1);
 #if MXNET_USE_CUDNN == 1
-    gpu_cudnn_dropout_state_copy_ = dmlc::GetEnv("MXNET_GPU_CUDNN_DROPOUT_STATE_COPY", 4);
+    gpu_cudnn_dropout_state_copy_ = dmlc::GetEnv("MXNET_GPU_CUDNN_DROPOUT_STATE_COPY", 1);

Review comment:
       the resource is per-GPU so contention would only happen if there are multiple random sampling ops happening at the same time.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on a change in pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha commented on a change in pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#discussion_r475991763



##########
File path: src/storage/storage.cc
##########
@@ -67,8 +67,9 @@ StorageManager *CreateStorageManager(const Context &ctx, const char *context,
                                      int num_gpu_device, std::string *pStrategy) {
   const auto env_var = env_var_name(context, pool_type);
   const char *type = getenv(env_var.c_str());
-  if (type == nullptr)
-    type = "Naive";   // default pool
+  if (type == nullptr) {
+    type = "Round";   // default pool

Review comment:
       I share the concern on this. This change will impact those static-size static-graph models that were at the boundary of GPU memory limit. Which of the current GluonCV model training scripts fall in this category?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] zhreshold commented on pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

zhreshold commented on pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#issuecomment-682271576


   > thanks for the reviews. @zhreshold how about setting the pool strategy to be naive when shapes are static?
   
   This is actually fine when shapes are static, my major concern is that with round enabled by default, in most use cases mxnet can be faster but consumes more memory than expected


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on a change in pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha commented on a change in pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#discussion_r475991763



##########
File path: src/storage/storage.cc
##########
@@ -67,8 +67,9 @@ StorageManager *CreateStorageManager(const Context &ctx, const char *context,
                                      int num_gpu_device, std::string *pStrategy) {
   const auto env_var = env_var_name(context, pool_type);
   const char *type = getenv(env_var.c_str());
-  if (type == nullptr)
-    type = "Naive";   // default pool
+  if (type == nullptr) {
+    type = "Round";   // default pool

Review comment:
       I share the concern on this. This change will impact those static-size static-graph models that were at the boundary of GPU memory limit. How many of the current GluonCV models fall in this category?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#issuecomment-683213514


   @zhreshold we could consider adding an interface to allocate the exact size and use it in cached op for static shape only.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#issuecomment-683387906


   I reverted the round pool change first to merge the rest of the changes. I will work on a cached op path to enable exact size allocation to avoid the memory waste in the static graph case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on a change in pull request #18987: [ENV] update runtime setting default values

Posted by GitBox <gi...@apache.org>.

szha commented on a change in pull request #18987:
URL: https://github.com/apache/incubator-mxnet/pull/18987#discussion_r475991763



##########
File path: src/storage/storage.cc
##########
@@ -67,8 +67,9 @@ StorageManager *CreateStorageManager(const Context &ctx, const char *context,
                                      int num_gpu_device, std::string *pStrategy) {
   const auto env_var = env_var_name(context, pool_type);
   const char *type = getenv(env_var.c_str());
-  if (type == nullptr)
-    type = "Naive";   // default pool
+  if (type == nullptr) {
+    type = "Round";   // default pool

Review comment:
       I share the concern on this. This change will impact those static-size static-graph models that were at the boundary of GPU memory limit. How many of the current GluonCV model training scripts fall in this category?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org