You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/09/22 16:48:13 UTC

[GitHub] [incubator-mxnet] ptrendx opened a new pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

ptrendx opened a new pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209


   ## Description ##
   Fixes #15662
   
   This PR seeds the dropout states inside the `ResourceManager` when `mx.random.seed` method is called and makes Dropout and RNN ops reuse those states in their dropout descriptors.
   
   @szha @eric-haibin-lin @DickJC123 
   
   ## Checklist ##
   ### Essentials ###
   - [x] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [x] All changes have test coverage
   - [x] Code is well-documented


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
szha commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r492979988



##########
File path: tests/python/gpu/test_gluon_gpu.py
##########
@@ -647,3 +647,31 @@ def test_gemms_true_fp16():
     assert_almost_equal(ref_results.asnumpy(), results_trueFP16.asnumpy(),
                         atol=atol, rtol=rtol)
 
+@with_seed()
+def test_cudnn_dropout_reproducibility():

Review comment:
       @assert_raises_cudnn_not_satisfied?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
szha commented on pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#issuecomment-698556256


   That's a good point. It's unclear to me what the performance impact would be if we don't allow multiple copies of random states in case one doesn't care about reproducibility. Also, I'm not sure whether waitall resolves the race condition completely in the case where multi-threaded execution happens and we use the thread-safe cached op.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] DickJC123 merged pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
DickJC123 merged pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] ptrendx commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
ptrendx commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r494569625



##########
File path: tests/python/gpu/test_gluon_gpu.py
##########
@@ -647,3 +647,31 @@ def test_gemms_true_fp16():
     assert_almost_equal(ref_results.asnumpy(), results_trueFP16.asnumpy(),
                         atol=atol, rtol=rtol)
 
+@with_seed()
+def test_cudnn_dropout_reproducibility():

Review comment:
       Ok, so this does not really work since the test does not really throw when cudnn is not supported. I will revert that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] ptrendx commented on pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
ptrendx commented on pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#issuecomment-698519737


   @szha Over the course of testing this PR I discovered a different possibility of the nondeterminism - if there are more than 1 copies of the dropout resource (or random resource for that matter) then you end up with the race condition of getting a new resource and call to `seed` setting the counter to 0. We could either remove the copies > 1 (and also remove parallelrandom resource, which has the same problem), or adding `waitall` to the `seed` call. Which one do you think is a better solution?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
szha commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r494764854



##########
File path: tests/python/gpu/test_gluon_gpu.py
##########
@@ -647,3 +647,31 @@ def test_gemms_true_fp16():
     assert_almost_equal(ref_results.asnumpy(), results_trueFP16.asnumpy(),
                         atol=atol, rtol=rtol)
 
+@with_seed()
+def test_cudnn_dropout_reproducibility():

Review comment:
       you can use runtime feature detection to determine whether to skip the test.
   https://github.com/apache/incubator-mxnet/commit/bd0846a69addf18529d5dcf856b28bed1a38cdbc#diff-e1b66bb36e566bfd108d2ef312defbd9R139




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
szha commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r492979125



##########
File path: src/resource.cc
##########
@@ -428,37 +486,28 @@ void* Resource::get_host_space_internal(size_t size) const {
 void Resource::get_cudnn_dropout_desc(
     cudnnDropoutDescriptor_t *dropout_desc,
     mshadow::Stream<gpu> *stream,
-    const float dropout, uint64_t seed,
+    const float dropout,
     const std::string &name) const {
 
   CHECK_EQ(req.type, ResourceRequest::kCuDNNDropoutDesc);
   auto state_space = static_cast<resource::SpaceAllocator*>(ptr_);
   CHECK_EQ(state_space->ctx.dev_id, stream->dev_id)
-    << "The device id of cuDNN dropout state space doesn't match that from stream.";
-  CUDNN_CALL(cudnnCreateDropoutDescriptor(dropout_desc));
+    << "The device id of cudnn dropout state space doesn't match that from stream.";
   if (dropout <= 0) {
     CUDNN_CALL(cudnnSetDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
                                          dropout,
                                          nullptr,
                                          0, seed));
-  } else if (!state_space->handle.size) {
-    // not initialized yet.
-    size_t dropout_state_size;
-    CUDNN_CALL(cudnnDropoutGetStatesSize(stream->dnn_handle_, &dropout_state_size));
-    // reserve GPU space
-    Storage::Get()->DirectFree(Storage::Get()->Alloc(dropout_state_size, state_space->ctx));
-    CUDNN_CALL(cudnnSetDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
-                                         dropout,
-                                         state_space->GetSpace(dropout_state_size, name),
-                                         dropout_state_size, seed));
   } else {
+    CHECK(state_space->handle.size > 0)
+      << "CUDNN dropout descriptor was not initialized yet!";
     // cudnnRestoreDropoutDescriptor() introduced with cuDNN v7
     STATIC_ASSERT_CUDNN_VERSION_GE(7000);
     CUDNN_CALL(cudnnRestoreDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
                                              dropout,
                                              state_space->handle.dptr,
                                              state_space->handle.size,
-                                             seed));
+                                             0));

Review comment:
       does the seed not matter?

##########
File path: tests/python/gpu/test_gluon_gpu.py
##########
@@ -647,3 +647,31 @@ def test_gemms_true_fp16():
     assert_almost_equal(ref_results.asnumpy(), results_trueFP16.asnumpy(),
                         atol=atol, rtol=rtol)
 
+@with_seed()
+def test_cudnn_dropout_reproducibility():

Review comment:
       @assert_raises_cudnn_not_satisfied?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] ptrendx commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
ptrendx commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r492986716



##########
File path: src/resource.cc
##########
@@ -428,37 +486,28 @@ void* Resource::get_host_space_internal(size_t size) const {
 void Resource::get_cudnn_dropout_desc(
     cudnnDropoutDescriptor_t *dropout_desc,
     mshadow::Stream<gpu> *stream,
-    const float dropout, uint64_t seed,
+    const float dropout,
     const std::string &name) const {
 
   CHECK_EQ(req.type, ResourceRequest::kCuDNNDropoutDesc);
   auto state_space = static_cast<resource::SpaceAllocator*>(ptr_);
   CHECK_EQ(state_space->ctx.dev_id, stream->dev_id)
-    << "The device id of cuDNN dropout state space doesn't match that from stream.";
-  CUDNN_CALL(cudnnCreateDropoutDescriptor(dropout_desc));
+    << "The device id of cudnn dropout state space doesn't match that from stream.";
   if (dropout <= 0) {
     CUDNN_CALL(cudnnSetDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
                                          dropout,
                                          nullptr,
                                          0, seed));
-  } else if (!state_space->handle.size) {
-    // not initialized yet.
-    size_t dropout_state_size;
-    CUDNN_CALL(cudnnDropoutGetStatesSize(stream->dnn_handle_, &dropout_state_size));
-    // reserve GPU space
-    Storage::Get()->DirectFree(Storage::Get()->Alloc(dropout_state_size, state_space->ctx));
-    CUDNN_CALL(cudnnSetDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
-                                         dropout,
-                                         state_space->GetSpace(dropout_state_size, name),
-                                         dropout_state_size, seed));
   } else {
+    CHECK(state_space->handle.size > 0)
+      << "CUDNN dropout descriptor was not initialized yet!";
     // cudnnRestoreDropoutDescriptor() introduced with cuDNN v7
     STATIC_ASSERT_CUDNN_VERSION_GE(7000);
     CUDNN_CALL(cudnnRestoreDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
                                              dropout,
                                              state_space->handle.dptr,
                                              state_space->handle.size,
-                                             seed));
+                                             0));

Review comment:
       It does not matter. From cuDNN API reference:
   ```
   seed
   Input. Seed used in prior call to cudnnSetDropoutDescriptor() that initialized states buffer. Using a different seed from this has no effect. A change of seed, and subsequent update to random number generator states can be achieved by calling cudnnSetDropoutDescriptor().
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
szha commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r494764854



##########
File path: tests/python/gpu/test_gluon_gpu.py
##########
@@ -647,3 +647,31 @@ def test_gemms_true_fp16():
     assert_almost_equal(ref_results.asnumpy(), results_trueFP16.asnumpy(),
                         atol=atol, rtol=rtol)
 
+@with_seed()
+def test_cudnn_dropout_reproducibility():

Review comment:
       you can use runtime feature detection to determine whether to skip the test.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] ptrendx commented on pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
ptrendx commented on pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#issuecomment-698519737






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
szha commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r494764854



##########
File path: tests/python/gpu/test_gluon_gpu.py
##########
@@ -647,3 +647,31 @@ def test_gemms_true_fp16():
     assert_almost_equal(ref_results.asnumpy(), results_trueFP16.asnumpy(),
                         atol=atol, rtol=rtol)
 
+@with_seed()
+def test_cudnn_dropout_reproducibility():

Review comment:
       you can use runtime feature detection to determine whether to skip the test.

##########
File path: tests/python/gpu/test_gluon_gpu.py
##########
@@ -647,3 +647,31 @@ def test_gemms_true_fp16():
     assert_almost_equal(ref_results.asnumpy(), results_trueFP16.asnumpy(),
                         atol=atol, rtol=rtol)
 
+@with_seed()
+def test_cudnn_dropout_reproducibility():

Review comment:
       you can use runtime feature detection to determine whether to skip the test.
   https://github.com/apache/incubator-mxnet/commit/bd0846a69addf18529d5dcf856b28bed1a38cdbc#diff-e1b66bb36e566bfd108d2ef312defbd9R139




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] ptrendx commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
ptrendx commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r492986716



##########
File path: src/resource.cc
##########
@@ -428,37 +486,28 @@ void* Resource::get_host_space_internal(size_t size) const {
 void Resource::get_cudnn_dropout_desc(
     cudnnDropoutDescriptor_t *dropout_desc,
     mshadow::Stream<gpu> *stream,
-    const float dropout, uint64_t seed,
+    const float dropout,
     const std::string &name) const {
 
   CHECK_EQ(req.type, ResourceRequest::kCuDNNDropoutDesc);
   auto state_space = static_cast<resource::SpaceAllocator*>(ptr_);
   CHECK_EQ(state_space->ctx.dev_id, stream->dev_id)
-    << "The device id of cuDNN dropout state space doesn't match that from stream.";
-  CUDNN_CALL(cudnnCreateDropoutDescriptor(dropout_desc));
+    << "The device id of cudnn dropout state space doesn't match that from stream.";
   if (dropout <= 0) {
     CUDNN_CALL(cudnnSetDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
                                          dropout,
                                          nullptr,
                                          0, seed));
-  } else if (!state_space->handle.size) {
-    // not initialized yet.
-    size_t dropout_state_size;
-    CUDNN_CALL(cudnnDropoutGetStatesSize(stream->dnn_handle_, &dropout_state_size));
-    // reserve GPU space
-    Storage::Get()->DirectFree(Storage::Get()->Alloc(dropout_state_size, state_space->ctx));
-    CUDNN_CALL(cudnnSetDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
-                                         dropout,
-                                         state_space->GetSpace(dropout_state_size, name),
-                                         dropout_state_size, seed));
   } else {
+    CHECK(state_space->handle.size > 0)
+      << "CUDNN dropout descriptor was not initialized yet!";
     // cudnnRestoreDropoutDescriptor() introduced with cuDNN v7
     STATIC_ASSERT_CUDNN_VERSION_GE(7000);
     CUDNN_CALL(cudnnRestoreDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
                                              dropout,
                                              state_space->handle.dptr,
                                              state_space->handle.size,
-                                             seed));
+                                             0));

Review comment:
       It does not matter. From cuDNN API reference:
   ```
   seed
   Input. Seed used in prior call to cudnnSetDropoutDescriptor() that initialized states buffer. Using a different seed from this has no effect. A change of seed, and subsequent update to random number generator states can be achieved by calling cudnnSetDropoutDescriptor().
   ```

##########
File path: tests/python/gpu/test_gluon_gpu.py
##########
@@ -647,3 +647,31 @@ def test_gemms_true_fp16():
     assert_almost_equal(ref_results.asnumpy(), results_trueFP16.asnumpy(),
                         atol=atol, rtol=rtol)
 
+@with_seed()
+def test_cudnn_dropout_reproducibility():

Review comment:
       :+1:




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] ptrendx commented on pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
ptrendx commented on pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#issuecomment-698635453


   Ok, I understand now the problem with reproducibility I saw - `cudnnSetDropoutDescriptor` is asynchronous and there was no proper synchronization of the CUDA stream, so if `cudnnSetDropoutDescriptor` was picked up by 1 thread and the dropout was picked up by another thread, there was race condition on the CUDA side. I fixed that in the latest commit.
   
   I still believe that there is a potential problem for resource assignment, although that is not something that would be typically hit as the ops are launched typically from a single thread. The thread-safe cachedop would be the main reason for this to fail, although it would require somebody to seed frequently during the execution, so that is also not very common scenario.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#issuecomment-696844580


   Hey @ptrendx , Thanks for submitting the PR 
   All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands: 
   - To trigger all jobs: @mxnet-bot run ci [all] 
   - To trigger specific jobs: @mxnet-bot run ci [job1, job2] 
   *** 
   **CI supported jobs**: [sanity, centos-cpu, windows-gpu, centos-gpu, edge, windows-cpu, unix-cpu, miscellaneous, website, unix-gpu, clang]
   *** 
   _Note_: 
    Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin. 
   All CI tests must pass before the PR can be merged. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] ptrendx commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
ptrendx commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r494569625



##########
File path: tests/python/gpu/test_gluon_gpu.py
##########
@@ -647,3 +647,31 @@ def test_gemms_true_fp16():
     assert_almost_equal(ref_results.asnumpy(), results_trueFP16.asnumpy(),
                         atol=atol, rtol=rtol)
 
+@with_seed()
+def test_cudnn_dropout_reproducibility():

Review comment:
       Ok, so this does not really work since the test does not really throw when cudnn is not supported. I will revert that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
szha commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r492979125



##########
File path: src/resource.cc
##########
@@ -428,37 +486,28 @@ void* Resource::get_host_space_internal(size_t size) const {
 void Resource::get_cudnn_dropout_desc(
     cudnnDropoutDescriptor_t *dropout_desc,
     mshadow::Stream<gpu> *stream,
-    const float dropout, uint64_t seed,
+    const float dropout,
     const std::string &name) const {
 
   CHECK_EQ(req.type, ResourceRequest::kCuDNNDropoutDesc);
   auto state_space = static_cast<resource::SpaceAllocator*>(ptr_);
   CHECK_EQ(state_space->ctx.dev_id, stream->dev_id)
-    << "The device id of cuDNN dropout state space doesn't match that from stream.";
-  CUDNN_CALL(cudnnCreateDropoutDescriptor(dropout_desc));
+    << "The device id of cudnn dropout state space doesn't match that from stream.";
   if (dropout <= 0) {
     CUDNN_CALL(cudnnSetDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
                                          dropout,
                                          nullptr,
                                          0, seed));
-  } else if (!state_space->handle.size) {
-    // not initialized yet.
-    size_t dropout_state_size;
-    CUDNN_CALL(cudnnDropoutGetStatesSize(stream->dnn_handle_, &dropout_state_size));
-    // reserve GPU space
-    Storage::Get()->DirectFree(Storage::Get()->Alloc(dropout_state_size, state_space->ctx));
-    CUDNN_CALL(cudnnSetDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
-                                         dropout,
-                                         state_space->GetSpace(dropout_state_size, name),
-                                         dropout_state_size, seed));
   } else {
+    CHECK(state_space->handle.size > 0)
+      << "CUDNN dropout descriptor was not initialized yet!";
     // cudnnRestoreDropoutDescriptor() introduced with cuDNN v7
     STATIC_ASSERT_CUDNN_VERSION_GE(7000);
     CUDNN_CALL(cudnnRestoreDropoutDescriptor(*dropout_desc, stream->dnn_handle_,
                                              dropout,
                                              state_space->handle.dptr,
                                              state_space->handle.size,
-                                             seed));
+                                             0));

Review comment:
       does the seed not matter?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
szha commented on pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#issuecomment-698556256


   That's a good point. It's unclear to me what the performance impact would be if we don't allow multiple copies of random states in case one doesn't care about reproducibility. Also, I'm not sure whether waitall resolves the race condition completely in the case where multi-threaded execution happens and we use the thread-safe cached op.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] ptrendx commented on a change in pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
ptrendx commented on a change in pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#discussion_r492991215



##########
File path: tests/python/gpu/test_gluon_gpu.py
##########
@@ -647,3 +647,31 @@ def test_gemms_true_fp16():
     assert_almost_equal(ref_results.asnumpy(), results_trueFP16.asnumpy(),
                         atol=atol, rtol=rtol)
 
+@with_seed()
+def test_cudnn_dropout_reproducibility():

Review comment:
       :+1:




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] ptrendx commented on pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
ptrendx commented on pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#issuecomment-698556584


   Actually it seems the `waitall` when seeding does not really help, while having only a single copy works well every time. Let me dig deeper into this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19209: [BUGFIX]Fix cuDNN dropout reproducibility

Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19209:
URL: https://github.com/apache/incubator-mxnet/pull/19209#issuecomment-696844580


   Hey @ptrendx , Thanks for submitting the PR 
   All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands: 
   - To trigger all jobs: @mxnet-bot run ci [all] 
   - To trigger specific jobs: @mxnet-bot run ci [job1, job2] 
   *** 
   **CI supported jobs**: [sanity, centos-cpu, windows-gpu, centos-gpu, edge, windows-cpu, unix-cpu, miscellaneous, website, unix-gpu, clang]
   *** 
   _Note_: 
    Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin. 
   All CI tests must pass before the PR can be merged. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org