You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/07/14 03:22:27 UTC

[GitHub] [incubator-mxnet] ArmageddonKnight opened a new pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

ArmageddonKnight opened a new pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704

## Description ##

This PR is supposed to fix the issue described in #18564

## Checklist ##
### Essentials ###
Please feel free to remove inapplicable items for your PR.
- [x] Changes are complete (i.e. I finished coding on this PR)
- [x] All changes have test coverage:
- Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
- Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
- Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
- [x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

### Changes ###
- [x] Removed the setting of `continuous_dump` when setting the profiler config.

## Comments ##
This PR is for fixing the issue described in #18564 . Specifically, the reason why the error happens is because the profiler config has the flag `continuous_dump` set, which would further propagate to the core function `Profiler::SetContinuousProfileDump` (shown below) and give an error on line 268.

https://github.com/apache/incubator-mxnet/blob/9d623926d4857a2cfa32515b58cd1398371f97f3/src/profiler/profiler.cc#L258-L268

FYI, @szha

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

leezu commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-678888452

@szha the gpu memory profiler tests have never been run. They were disabled until https://github.com/apache/incubator-mxnet/pull/18701 due to declaration in the wrong file. I'm not sure what the following lines refer to, and if / why they were recorded differently before. As per the comment in the test file, the test was designed to capture 26 lines, but we now capture 33 lines. In particular, the following lines were not recorded before and appear to be the problematic ones. I don't think that recording more events in the memory profiler is related to the name-scope refactor, as the profiler is enabled and disabled separately from the name-scope.

```
[2020-07-15T20:15:02.236Z] <unk>:_random_uniform,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:_random_uniform,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:_zeros,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:in_arg:data,640,0,640,0

[2020-07-15T20:15:02.236Z] <unk>:unknown,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:unknown,67108864,0,67108864,0
```

Also note that the profiler scope implementation has some fundamental threading issues. The scope in the backend is not thread local, but the frontend "claimns" thread safety by using a thread local variable in the frontend. There same problem applies to single-threaded but asynchronous code.

@ArmageddonKnight the profiler scope is automatically set in Block and Optimizer. Would you be able to debug the issue and fix it in this PR?

https://github.com/apache/incubator-mxnet/blob/0de7484884292eb028342b1e5669233792429af0/python/mxnet/gluon/block.py#L951-L952

https://github.com/apache/incubator-mxnet/blob/0de7484884292eb028342b1e5669233792429af0/python/mxnet/optimizer/updater.py#L58-L59

Parameter initialization happens separately from Block, and if you want to set the profiler scope based there, you can add `with _profiler.scope(self._uuid + ':'):` in `_finish_deferred_init` inside `parameter.py`.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

Zha0q1 commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-658565496


   Any ideas on what might have caused the error in ```Profiler::SetContinuousProfileDump```?
   
   This same error seems to also cause issues in ```tests/python/unittest/test_profiler.py```, as seen in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18701/1/pipeline/378/
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-679394994


   > The unknown allocation entries actually come from the symbolic execution backend, rather than Gluon
   
   On master branch, we no longer have a symbolic execution backend as graph executor was removed.
   
   > The parameters of in the Gluon backend have weird tags
   
   That's the problem I mentioned in the above comment about the usage of uuid in naming.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ArmageddonKnight commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

ArmageddonKnight commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-658492294


   @szha I have updated the changes accordingly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#discussion_r485059844



##########
File path: tests/python/gpu/test_profiler_gpu.py
##########
@@ -117,31 +129,64 @@ def test_gpu_memory_profiler_gluon():
     profiler.set_state('stop')
     profiler.dump(True)
 
+    # Sample gpu_memory_profile.csv:
+    # "Attribute Name","Requested Size","Device","Actual Size","Reuse?"
+    # <unk>:in_arg:data,640,0,4096,0
+    # hybridsequential:activation0:hybridsequential_activation0_fwd,2048,0,4096,0
+    # hybridsequential:activation0:hybridsequential_activation0_fwd_backward,8192,0,8192,0
+    # hybridsequential:activation0:hybridsequential_activation0_fwd_head_grad,2048,0,4096,0
+    # hybridsequential:dense0:activation0:hybridsequential_dense0_activation0_fwd,8192,0,8192,0
+    # hybridsequential:dense0:arg_grad:bias,512,0,4096,0
+    # hybridsequential:dense0:arg_grad:weight,5120,0,8192,0
+    # hybridsequential:dense0:hybridsequential_dense0_fwd,8192,0,8192,0
+    # hybridsequential:dense0:in_arg:bias,512,0,4096,0
+    # hybridsequential:dense0:in_arg:weight,5120,0,8192,0
+    # hybridsequential:dense1:activation0:hybridsequential_dense1_activation0_fwd,4096,0,4096,0
+    # hybridsequential:dense1:arg_grad:bias,256,0,4096,0
+    # hybridsequential:dense1:arg_grad:weight,32768,0,32768,0
+    # hybridsequential:dense1:hybridsequential_dense1_fwd,4096,0,4096,0
+    # hybridsequential:dense1:in_arg:bias,256,0,4096,0
+    # hybridsequential:dense1:in_arg:weight,32768,0,32768,0
+    # hybridsequential:dense2:arg_grad:bias,128,0,4096,0
+    # hybridsequential:dense2:arg_grad:weight,8192,0,8192,0
+    # hybridsequential:dense2:hybridsequential_dense2_fwd_backward,4096,0,4096,1
+    # hybridsequential:dense2:in_arg:bias,128,0,4096,0
+    # hybridsequential:dense2:in_arg:weight,8192,0,8192,0
+    # hybridsequential:dropout0:hybridsequential_dropout0_fwd,8192,0,8192,0
+    # hybridsequential:dropout0:hybridsequential_dropout0_fwd,8192,0,8192,0
+    # resource:cudnn_dropout_state (dropout-inl.h +256),1474560,0,1474560,0
+    # resource:temp_space (fully_connected-inl.h +316),15360,0,16384,0
+
     # We are only checking for weight parameters here, also making sure that
     # there is no unknown entries in the memory profile.
     with open('gpu_memory_profile-pid_%d.csv' % (os.getpid()), mode='r') as csv_file:
         csv_reader = csv.DictReader(csv_file)
+        # TODO: Remove this print statement later on.
         for row in csv_reader:
             print(",".join(list(row.values())))
-        for scope in ['in_arg', 'arg_grad']:
-            for key, nd in model.collect_params().items():
-                expected_arg_name = "%s:%s:" % (model.name, scope) + nd.name
-                expected_arg_size = str(4 * np.prod(nd.shape))
-                csv_file.seek(0)
-                entry_found = False
-                for row in csv_reader:
-                    if row['Attribute Name'] == expected_arg_name:
-                        assert row['Requested Size'] == expected_arg_size, \
-                            "requested size={} is not equal to the expected size={}" \
-                            .format(row['Requested Size'], expected_arg_size)
-                        entry_found = True
-                        break
-                assert entry_found, \
-                    "Entry for attr_name={} has not been found" \
-                    .format(expected_arg_name)
+        for param in model.collect_params().values():
+            expected_arg_name = "%sin_arg:" % param.var().attr('__profiler_scope__') + \
+                                param.name
+            expected_arg_size = str(4 * np.prod(param.shape))
+            csv_file.seek(0)
+            entry_found = False
+            for row in csv_reader:
+                if row['Attribute Name'] == expected_arg_name and \
+                   row['Requested Size'] == expected_arg_size:
+                    entry_found = True
+                    break
+            assert entry_found, \
+                    "Entry for (attr_name={}, alloc_size={}) has not been found" \
+                        .format(expected_arg_name,
+                                expected_arg_size)
         # Make sure that there is no unknown allocation entry.
         csv_file.seek(0)
         for row in csv_reader:
             if row['Attribute Name'] == "<unk>:unknown" or \
                row['Attribute Name'] == "<unk>:":
                 assert False, "Unknown allocation entry has been encountered"
+
+
+if __name__ == "__main__":
+    import nose
+    nose.runmodule()

Review comment:
       @ArmageddonKnight please remove this

##########
File path: tests/python/gpu/test_profiler_gpu.py
##########
@@ -62,41 +62,53 @@ def test_gpu_memory_profiler_symbolic():
             {'Attribute Name' : 'tensordot:in_arg:B',
              'Requested Size' : str(4 * b.size)},
             {'Attribute Name' : 'tensordot:dot',
-             'Requested Size' : str(4 * c.size)}]
+             'Requested Size' : str(4 * c.size)},
+            {'Attribute Name' : 'init:_random_uniform',
+             'Requested Size' : str(4 * a.size)},
+            {'Attribute Name' : 'init:_random_uniform',
+             'Requested Size' : str(4 * b.size)}]
 
     # Sample gpu_memory_profile.csv:
     # "Attribute Name","Requested Size","Device","Actual Size","Reuse?"
-    # "<unk>:_zeros","67108864","0","67108864","0"
-    # "<unk>:_zeros","67108864","0","67108864","0"
-    # "tensordot:dot","67108864","0","67108864","1"
-    # "tensordot:dot","67108864","0","67108864","1"
-    # "tensordot:in_arg:A","67108864","0","67108864","0"
-    # "tensordot:in_arg:B","67108864","0","67108864","0"
-    # "nvml_amend","1074790400","0","1074790400","0"
+    # init:_random_uniform,33554432,0,33554432,1
+    # init:_random_uniform,8388608,0,8388608,1
+    # resource:temp_space (sample_op.h +365),8,0,4096,0
+    # symbol:arg_grad:unknown,8388608,0,8388608,0
+    # symbol:arg_grad:unknown,33554432,0,33554432,0
+    # tensordot:dot,16777216,0,16777216,0
+    # tensordot:dot_backward,33554432,0,33554432,0
+    # tensordot:dot_backward,8388608,0,8388608,0
+    # tensordot:dot_head_grad,16777216,0,16777216,0
+    # tensordot:in_arg:A,8388608,0,8388608,0
+    # tensordot:in_arg:B,33554432,0,33554432,0
 
     with open('gpu_memory_profile-pid_%d.csv' % (os.getpid()), mode='r') as csv_file:
         csv_reader = csv.DictReader(csv_file)
+        # TODO: Remove this print statement later on.
+        for row in csv_reader:
+            print(",".join(list(row.values())))

Review comment:
       Fix TODO

##########
File path: src/imperative/imperative.cc
##########
@@ -233,7 +233,7 @@ void Imperative::RecordOp(
 
   nnvm::ObjectPtr node = nnvm::Node::Create();
   node->attrs = std::move(attrs);
-  node->attrs.name = "node_" + std::to_string(node_count_++);
+  node_count_ += 1;

Review comment:
       Why do you disable naming the ops in autograd recording?

##########
File path: src/profiler/storage_profiler.cc
##########
@@ -61,11 +62,17 @@ void GpuDeviceStorageProfiler::DumpProfile() const {
   std::multimap<std::string, AllocEntryDumpFmt> gpu_mem_ordered_alloc_entries;
   // map the GPU device ID to the total amount of allocations
   std::unordered_map<int, size_t> gpu_dev_id_total_alloc_map;
+  std::regex gluon_param_regex("([0-9a-fA-F]{8})_([0-9a-fA-F]{4})_"
+                               "([0-9a-fA-F]{4})_([0-9a-fA-F]{4})_"
+                               "([0-9a-fA-F]{12})_([^ ]*)");
+
   for (const std::pair<void *const, AllocEntry>& alloc_entry :
        gpu_mem_alloc_entries_) {
+    std::string alloc_entry_name
+        = std::regex_replace(alloc_entry.second.name, gluon_param_regex, "$6");

Review comment:
       You are relying on `self._var_name = self._uuid.replace('-', '_') + '_' + self._name` from parameter.py here. This isn't true in case of SymbolBlock import. Have you tested the profiler feature with SymbolBlock? The current assumption may be a bit fragile?

##########
File path: src/imperative/imperative.cc
##########
@@ -322,7 +322,7 @@ void Imperative::RecordDeferredCompute(nnvm::NodeAttrs &&attrs,
   }
   node->attrs = std::move(attrs);
   // Need to support NameManager in imperative API to better name node->attrs.name
-  node->attrs.name = "node_" + std::to_string(node_count_++);
+  node_count_ += 1;

Review comment:
       Why do you disable naming the ops in dc recording?

##########
File path: python/mxnet/gluon/parameter.py
##########
@@ -644,7 +644,7 @@ def var(self):
         """Returns a symbol representing this parameter."""
         if self._var is None:
             if self._var_name is None:  # _var_name is set manually in SymbolBlock.import
-                self._var_name = self._uuid
+                self._var_name = self._uuid.replace('-', '_') + '_' + self._name

Review comment:
       You need to document that `src/profiler/storage_profiler.cc` depends on this scheme.

##########
File path: tests/python/gpu/test_profiler_gpu.py
##########
@@ -15,44 +15,44 @@
 # specific language governing permissions and limitations
 # under the License.
 
+import csv
 import os
 import sys
 
+import numpy as np
 import mxnet as mx
 mx.test_utils.set_default_context(mx.gpu(0))
 
 curr_path = os.path.dirname(os.path.abspath(os.path.expanduser(__file__)))
 sys.path.insert(0, os.path.join(curr_path, '../unittest'))
-# We import all tests from ../unittest/test_profiler.py
-# They will be detected by test framework, as long as the current file has a different filename
-from test_profiler import *

Review comment:
       Why do you disable the `./unittest/test_profiler.py` tests on GPU?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ArmageddonKnight commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

ArmageddonKnight commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-688516458


   @szha FYI, I have cleaned up the debugging traces and applied the fixes that we have discussed previously. The sample GPU memory profile for Gluon now looks like the following:
   
   ```
   "Attribute Name","Requested Size","Device","Actual Size","Reuse?"
   <unk>:in_arg:data,640,0,4096,0
   hybridsequential:activation0:hybridsequential_activation0_fwd,2048,0,4096,0
   hybridsequential:activation0:hybridsequential_activation0_fwd_backward,8192,0,8192,0
   hybridsequential:activation0:hybridsequential_activation0_fwd_head_grad,2048,0,4096,0
   hybridsequential:dense0:activation0:hybridsequential_dense0_activation0_fwd,8192,0,8192,0
   hybridsequential:dense0:arg_grad:bias,512,0,4096,0
   hybridsequential:dense0:arg_grad:weight,5120,0,8192,0
   hybridsequential:dense0:hybridsequential_dense0_fwd,8192,0,8192,0
   hybridsequential:dense0:in_arg:bias,512,0,4096,0
   hybridsequential:dense0:in_arg:weight,5120,0,8192,0
   hybridsequential:dense1:activation0:hybridsequential_dense1_activation0_fwd,4096,0,4096,0
   hybridsequential:dense1:arg_grad:bias,256,0,4096,0
   hybridsequential:dense1:arg_grad:weight,32768,0,32768,0
   hybridsequential:dense1:hybridsequential_dense1_fwd,4096,0,4096,0
   hybridsequential:dense1:in_arg:bias,256,0,4096,0
   hybridsequential:dense1:in_arg:weight,32768,0,32768,0
   hybridsequential:dense2:arg_grad:bias,128,0,4096,0
   hybridsequential:dense2:arg_grad:weight,8192,0,8192,0
   hybridsequential:dense2:hybridsequential_dense2_fwd_backward,4096,0,4096,1
   hybridsequential:dense2:in_arg:bias,128,0,4096,0
   hybridsequential:dense2:in_arg:weight,8192,0,8192,0
   hybridsequential:dropout0:hybridsequential_dropout0_fwd,8192,0,8192,0
   hybridsequential:dropout0:hybridsequential_dropout0_fwd,8192,0,8192,0
   resource:cudnn_dropout_state (dropout-inl.h +256),1474560,0,1474560,0
   resource:temp_space (fully_connected-inl.h +316),15360,0,16384,0
   ```
   
   Please let me know if you have any questions or concerns. Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

leezu commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-692151568


   @ArmageddonKnight it would be great if you can help address the issues introduced by this PR as discussed above
   
   If you don't have time to address them, please mention it here and either someone else can fix it or we can revert the PR and resubmit with fixes later. WDYT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-658486411


   FYI I merged #18701 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#discussion_r485192790



##########
File path: src/imperative/imperative.cc
##########
@@ -233,7 +233,7 @@ void Imperative::RecordOp(
 
   nnvm::ObjectPtr node = nnvm::Node::Create();
   node->attrs = std::move(attrs);
-  node->attrs.name = "node_" + std::to_string(node_count_++);
+  node_count_ += 1;

Review comment:
       Can you reference the prior name assignment you have in mind?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha merged pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

szha merged pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-678740030


   Looks like the parameter initialization allocation is missing. @leezu looks like this part of tagging is missing as a result of gluon refactor. Could you help suggest a way of naming these allocation in the new Gluon implementation? https://github.com/apache/incubator-mxnet/pull/17656/files#diff-29da832c2145752f3906a2f71c7b63ba


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

leezu commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-692159708


   No worries. Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-658950245


   Jenkins CI successfully triggered : [unix-cpu, centos-cpu, unix-gpu, centos-gpu]


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#discussion_r485222910



##########
File path: src/imperative/imperative.cc
##########
@@ -233,7 +233,7 @@ void Imperative::RecordOp(
 
   nnvm::ObjectPtr node = nnvm::Node::Create();
   node->attrs = std::move(attrs);
-  node->attrs.name = "node_" + std::to_string(node_count_++);
+  node_count_ += 1;

Review comment:
       The example you give relies on the NameManager feature, which is only available in symbolic API. Note that autograd and dc do not apply to the symbolic API, thus they will not overwrite names set by the NameManager. That's why the names are generated via `node_count_ += 1;`, which you now deleted




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ArmageddonKnight commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

ArmageddonKnight commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-692158112


   > @ArmageddonKnight it would be great if you can help address the issues introduced by this PR as discussed above
   > 
   > If you don't have time to address them, please mention it here and either someone else can fix it or we can revert the PR and resubmit with fixes later. WDYT?
   
   @leezu I have been traveling and just came back. Addressing those comments now. Pls give me some time. Thanks. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-688526701


   @ArmageddonKnight thanks for the update! the change looks good to me. There are a couple of tests that failed due to the `storage_handle()` calls. Once they are fixed I think we should be good to go.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-688687610


   thanks for the fix!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ArmageddonKnight commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

ArmageddonKnight commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-658554003


   @mxnet-bot run ci [unix-cpu, unix-gpu, centos-cpu, centos-gpu]


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

leezu commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-678890299


   A parameter may belong to multiple blocks or to no Block at all, thus there is no 1-1 mapping. I think this is a separate issue and we may first need to fix the memory profiler.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-657943644


   Hey @ArmageddonKnight , Thanks for submitting the PR 
   All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands: 
   - To trigger all jobs: @mxnet-bot run ci [all] 
   - To trigger specific jobs: @mxnet-bot run ci [job1, job2] 
   *** 
   **CI supported jobs**: [windows-gpu, windows-cpu, unix-cpu, sanity, centos-gpu, centos-cpu, website, edge, miscellaneous, clang, unix-gpu]
   *** 
   _Note_: 
    Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin. 
   All CI tests must pass before the PR can be merged. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ArmageddonKnight commented on a change in pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

ArmageddonKnight commented on a change in pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#discussion_r488903487



##########
File path: tests/python/gpu/test_profiler_gpu.py
##########
@@ -15,44 +15,44 @@
 # specific language governing permissions and limitations
 # under the License.
 
+import csv
 import os
 import sys
 
+import numpy as np
 import mxnet as mx
 mx.test_utils.set_default_context(mx.gpu(0))
 
 curr_path = os.path.dirname(os.path.abspath(os.path.expanduser(__file__)))
 sys.path.insert(0, os.path.join(curr_path, '../unittest'))
-# We import all tests from ../unittest/test_profiler.py
-# They will be detected by test framework, as long as the current file has a different filename
-from test_profiler import *
+from mxnet import profiler
+from mxnet.gluon import nn
+from mxnet.gluon.block import _block_scope
+from test_profiler import enable_profiler
 
-# Test seen to crash pytest worker during development of https://github.com/apache/incubator-mxnet/pull/18694
-del test_aggregate_duplication

Review comment:
       @leezu As we can see here, the test case was not working before.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ArmageddonKnight edited a comment on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

ArmageddonKnight edited a comment on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-658554003


   @mxnet-bot run ci [centos-cpu, centos-gpu, unix-cpu, unix-gpu]


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-678889309


   > you can add `with _profiler.scope(self._uuid + ':'):` in `_finish_deferred_init` inside `parameter.py`.
   
   The problem with this is that the result will not be meaningful. The memory profiler output should reflect which block a parameter belongs to in order to properly capture the memory usage of a block. We will need some way of mapping the initialization back to the block and not rely on uuid.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ArmageddonKnight commented on a change in pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

ArmageddonKnight commented on a change in pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#discussion_r485194096



##########
File path: src/imperative/imperative.cc
##########
@@ -233,7 +233,7 @@ void Imperative::RecordOp(
 
   nnvm::ObjectPtr node = nnvm::Node::Create();
   node->attrs = std::move(attrs);
-  node->attrs.name = "node_" + std::to_string(node_count_++);
+  node_count_ += 1;

Review comment:
       As an example:
   
   ```
   C = mx.symbol.dot(A, B, name='dot')
   ```
   
   Here the name for C would be `dot`, however, autograd will overwrite its name to be `node_0` (or something like that).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-658554039


   Jenkins CI successfully triggered : [centos-gpu, unix-cpu, unix-gpu, centos-cpu]


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ArmageddonKnight commented on a change in pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

ArmageddonKnight commented on a change in pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#discussion_r485152035



##########
File path: src/imperative/imperative.cc
##########
@@ -233,7 +233,7 @@ void Imperative::RecordOp(
 
   nnvm::ObjectPtr node = nnvm::Node::Create();
   node->attrs = std::move(attrs);
-  node->attrs.name = "node_" + std::to_string(node_count_++);
+  node_count_ += 1;

Review comment:
       Because this naming will overwrite the name assigned previously to `node->attrs.name`, which is less descriptive. 

##########
File path: src/imperative/imperative.cc
##########
@@ -322,7 +322,7 @@ void Imperative::RecordDeferredCompute(nnvm::NodeAttrs &&attrs,
   }
   node->attrs = std::move(attrs);
   // Need to support NameManager in imperative API to better name node->attrs.name
-  node->attrs.name = "node_" + std::to_string(node_count_++);
+  node_count_ += 1;

Review comment:
       Ditto




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ArmageddonKnight commented on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

ArmageddonKnight commented on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-679390783


   @szha  I am looking into this issue:
   
   - The unknown allocation entries actually come from the symbolic execution backend, rather than Gluon.
   - The parameters of in the Gluon backend have weird tags, something similar to the following:
   
   ```
   "hybridsequential0:arg_grad:param_277c1286_334d_46dc_bbb2_0850c497c7d0_weight","8192","0","8192","0"
   "hybridsequential0:arg_grad:param_9c9c45b2_2e7f_4af9_8b60_b80ef0f2dd71_weight","5120","0","8192","0"
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu edited a comment on pull request #18704: Re-enable the test_gpu_memory_profiler_gluon test case

Posted by GitBox <gi...@apache.org>.

leezu edited a comment on pull request #18704:
URL: https://github.com/apache/incubator-mxnet/pull/18704#issuecomment-678890299


   A parameter may belong to multiple blocks or to no Block at all, thus there is no 1-1 mapping. I think this is a separate issue and we may first need to fix the memory profiler. The first question here is why the `_random_uniform` calls are recorded now and were not recorded when the test file was written (ie. 26 vs 33 recorded calls).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org