You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/08/08 07:33:54 UTC

[GitHub] jinhuang415 edited a comment on issue #11998: Flaky test MKLDNN_BASE.MKLDNNSum

jinhuang415 edited a comment on issue #11998: Flaky test MKLDNN_BASE.MKLDNNSum
URL: https://github.com/apache/incubator-mxnet/issues/11998#issuecomment-411313201
 
 
   @marcoabreu @zheng-da @azai91 @pengzhao-intel 
   We have found the root cause of this issue and here is the analysis:
   The test failed at MKLDNNSum in place test case, and failed due to delay_alloc check failure in SetMKLMem(), 
   ```
   void NDArray::Chunk::SetMKLMem(const TShape &shape, int dtype) {
   ...
     if (shandle.dptr == nullptr) {
       CHECK(delay_alloc);         <<< check failure here
       CheckAndAlloc();
     }
   ...
   ```
   it indicates the related NDArray's shandle is nullptr but delay_alloc is set to false,  usually delay_alloc will only be set to false after shandle is allocated, so most probably there is race condition here (this issue is very hard to reproduce, only happens occasionally), and I tried to print delay_alloc 2 times before `CHECK(delay_alloc) `and it is found when issue happens delay_alloc is true at first but then become false for the second time! This confirms that delay_alloc must be set to false by another thread asynchronously and cause thread race condition (print will make the race condition easier to happen). By checking the test code it is found there may exist race condition for below 2 calls (commented lines)
   
   ```
   TEST(MKLDNN_BASE, MKLDNNSum) {
    ...
       NDArrayAttrs orig_arr(in_arr.arr.Copy(in_arr.arr.ctx()), "In Place Copy");    <<< Copy will push operation to engine asynchronously which will alloc shandle for orig_arr.arr
       PrintVerifyMsg(orig_arr, in_arr);
       InitMKLDNNArray(&orig_arr.arr, input_mem->get_primitive_desc());     <<< may alloc shandle for orig_arr.arr which cause conflict
       orig_arr.arr.CopyFrom(*input_mem);
   ...
   }
   ```
   
   The detailed function calls is as below:
   ```
   NDArray::Copy()
     --> CopyFromTo(*this, ret)
       --> Engine::Get()->PushAsync(...,   
              CopyFromToImpl<cpu, cpu>(from, to, ctx, requested) ...)
                 --> CopyFromToDnsImpl()
                     --> to.GetMKLDNNData()
                        --> ptr_->SetMKLMem()
                          --> CHECK(delay_alloc) 
                          -->  CheckAndAlloc()
                             --> alloc shandle and set delay_alloc to false
                            
   
   InitMKLDNNArray()
     --> InitDefaultArray(arr, is_rand)
       --> arr->data()
         --> CheckAndAlloc()
            --> alloc shandle and set delay_alloc to false
   ```
   
   InitMKLDNNArray() happens in test main thread while CopyFromToImpl() happens asynchronously in Engine thread, so there may exist thread race condition here that InitMKLDNNArray() set delay_alloc to false and cause the CHECK failed for CopyFromToImpl().
   
   A simple fix is to add barrier to wait for Copy operation to finish before further operations to get rid of the race condition. The proposed change is as below:
   ```
   
   diff --git a/tests/cpp/operator/mkldnn.cc b/tests/cpp/operator/mkldnn.cc
   index 4f8f6f1..59bd3a5 100644
   --- a/tests/cpp/operator/mkldnn.cc
   +++ b/tests/cpp/operator/mkldnn.cc
   @@ -1308,6 +1308,7 @@ TEST(MKLDNN_BASE, MKLDNNSum) {
        auto input_mem = in_arr.arr.GetMKLDNNData();
        auto input_mem2 = in_arr2.arr.GetMKLDNNData();
        NDArrayAttrs orig_arr(in_arr.arr.Copy(in_arr.arr.ctx()), "In Place Copy");
   +    orig_arr.arr.WaitToRead();
        PrintVerifyMsg(orig_arr, in_arr);
        InitMKLDNNArray(&orig_arr.arr, input_mem->get_primitive_desc());
        orig_arr.arr.CopyFrom(*input_mem);
   
   
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services