You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/03/06 18:46:12 UTC

[GitHub] [incubator-mxnet] leezu opened a new issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)

leezu opened a new issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)
URL: https://github.com/apache/incubator-mxnet/issues/17782
 
 
   ## Description
   On OSX, `test_gluon_data.test_multi_worker_dataloader_release_pool` frequently fails with ` ... libc++abi.dylib: terminating with uncaught exception of type dmlc::Error: [06:16:36] ../src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0) : `
   
   https://github.com/apache/incubator-mxnet/blob/db2295b0f39852c3af2878932f48a672982f42d4/src/storage/cpu_shared_storage_manager.h#L216-L242
   
   CC: @eric-haibin-lin 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] zhreshold commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)

Posted by GitBox <gi...@apache.org>.
zhreshold commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)
URL: https://github.com/apache/incubator-mxnet/issues/17782#issuecomment-596263666
 
 
   I don't think I've seen such error log before until Leo send me the log. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] zixuanweeei commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)

Posted by GitBox <gi...@apache.org>.
zixuanweeei commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)
URL: https://github.com/apache/incubator-mxnet/issues/17782#issuecomment-599968539
 
 
   We also have this problem built from MXNet on Linux platform using the following command:
   ```
   make -j DEBUG=1 ENABLE_TESTCOVERAGE=0 USE_CPP_PACKAGE=0 USE_MKLDNN=1 USE_BLAS=blas USE_SIGNAL_HANDLER=1
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)
URL: https://github.com/apache/incubator-mxnet/issues/17782#issuecomment-596827563
 
 
   Need to revert https://github.com/apache/incubator-mxnet/pull/17797/ when fixing this issue

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] zixuanweeei edited a comment on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)

Posted by GitBox <gi...@apache.org>.
zixuanweeei edited a comment on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)
URL: https://github.com/apache/incubator-mxnet/issues/17782#issuecomment-599968539
 
 
   We also have this problem built from source on Linux platform using the following command:
   ```
   make -j DEBUG=1 ENABLE_TESTCOVERAGE=0 USE_CPP_PACKAGE=0 USE_MKLDNN=1 USE_BLAS=blas USE_SIGNAL_HANDLER=1
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)
URL: https://github.com/apache/incubator-mxnet/issues/17782#issuecomment-596024481
 
 
   @zhreshold have you seen this before? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] zixuanweeei commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)

Posted by GitBox <gi...@apache.org>.
zixuanweeei commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)
URL: https://github.com/apache/incubator-mxnet/issues/17782#issuecomment-600371267
 
 
   Tested at 1efb747ad5cf695f181c64bfb16daccd6ca4c456 using the following command:
   ```
   MXNET_SUBGRAPH_VERBOSE=0 nosetests -s -v --logging-level=DEBUG tests/python/unittest/test_gluon_data.py
   ```
   
   Its outputs:
   ```
   [INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1673400754 to reproduce.
   test_gluon_data.test_array_dataset ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1615701232 to reproduce.
   ok
   test_gluon_data.test_recordimage_dataset ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=187503193 to reproduce.
   ok
   test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1225456503 to reproduce.
   ok
   test_gluon_data.test_sampler ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1061742723 to reproduce.
   ok
   test_gluon_data.test_datasets ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1762025483 to reproduce.
   ok
   test_gluon_data.test_image_folder_dataset ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731150274 to reproduce.
   ok
   test_gluon_data.test_list_dataset ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1768153160 to reproduce.
   ok
   test_gluon_data.test_multi_worker ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1350251679 to reproduce.
   ok
   test_gluon_data.test_multi_worker_shape ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1650045375 to reproduce.
   ok
   test_gluon_data.test_multi_worker_forked_data_loader ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=589614560 to reproduce.
   terminate called after throwing an instance of 'dmlc::Error'
     what():  [08:54:22] src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0) :
   
   
   [1]    63713 abort      MXNET_SUBGRAPH_VERBOSE=0 nosetests -s -v --logging-level=DEBUG
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)
URL: https://github.com/apache/incubator-mxnet/issues/17782#issuecomment-600267777
 
 
   @zixuanweeei can you share further instructions how to reproduce on linux? I was unable to reproduce the failure with the cmake build on linux.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on issue #17782: cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)

Posted by GitBox <gi...@apache.org>.
ptrendx commented on issue #17782:
URL: https://github.com/apache/incubator-mxnet/issues/17782#issuecomment-732443750


   We encountered this issue when moving to Ubuntu 20.04 and Python 3.8. The issue does not happen during any single test, but running multiple dataworker tests in a row triggers it (the repro that we used for investigation was
   ```
   nosetests --verbose -s tests/python/unittest/test_gluon_data.py:test_multi_worker{,_shape,_forked_data_loader,_dataloader_release_pool}
   ```
   The root cause seems to be how Python memory management interacts with forking. Basically if there are still some shared memory `NDArrays` present (because of garbage collection or the running operation) when the new dataloader from the subsequent test is created, the child processes get the copies of that `NDArray` (without actually incrementing the usage count on it), since the usage counter as well as the actual data is in the shared memory region instead of the memory space of the process calling `fork`. So after the fork, when the Python garbage collection kicks in, all of the processes (both the parent actually holding that `NDArray` in the first place as well as its children) try to destroy the `NDArray`. Now 2 scenarios can happen:
    - the parent is "lucky" and destroys it first - then the counter becomes 0 and the children then die because of this `CHECK` inside `cpu_shared_storage_manager.h` -> this results in a hang of the dataloader as observed in #17774
    - the parent is "unlucky" and one of the children destroys the `NDArray` first -> parent hits the `CHECK` itself and dies with the error from this issue.
   
   We intend to workaround this issue for our upcoming release by inserting
   ```python
   mx.nd.waitall()
   import gc
   gc.collect()
   ```
   in the Gluon Dataloader contructor (which made the error disappear in our tests) but a more robust solution should be devised (maybe increment all the shared memory arrays during fork?).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org