You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2021/02/10 02:10:20 UTC

[GitHub] [incubator-mxnet] josephevans opened a new issue #19877: [v1.x] test_gluon_data unit tests failing

josephevans opened a new issue #19877:
URL: https://github.com/apache/incubator-mxnet/issues/19877


   ## Description
   On the v1.x pipeline, we are seeing the following test failures consistently:
   
   in tests/python/unittest/test_gluon_data.py:
   
   test_multi_worker_dataloader_release_pool
   test_multi_worker_forked_data_loader
   
   ## Occurrences
   https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19872/7/pipeline/293/#step-776-log-1725
   https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19872/4/pipeline/296
   
   Test failure logs:
   ```
   [2021-02-10T01:39:46.205Z] test_gluon_data.test_multi_worker_dataloader_release_pool ... terminate called after throwing an instance of 'dmlc::Error'
   [2021-02-10T01:39:46.205Z]   what():  [01:39:41] src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0) : 
   [2021-02-10T01:39:46.205Z] Stack trace:
   [2021-02-10T01:39:46.205Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61) [0x7f191fc63b61]
   [2021-02-10T01:39:46.205Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle const&)+0xd3) [0x7f192522fdf3]
   [2021-02-10T01:39:46.205Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0x98) [0x7f1925237348]
   [2021-02-10T01:39:46.205Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x69) [0x7f1925232ce9]
   [2021-02-10T01:39:46.205Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5ade409) [0x7f1924b21409]
   [2021-02-10T01:39:46.205Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x61d3c50) [0x7f1925216c50]
   [2021-02-10T01:39:46.205Z]   [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xa50) [0x7f1925210440]
   [2021-02-10T01:39:46.205Z]   [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x349) [0x7f192522c9d9]
   [2021-02-10T01:39:46.205Z]   [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x42b) [0x7f1925219f5b]
   [2021-02-10T01:39:46.205Z]   [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0xd8) [0x7f1925216948]
   [2021-02-10T01:39:46.461Z] /work/runtime_functions.sh: line 1008:     6 Aborted                 (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_unittest.xml --verbose 
   ```
   
   ```
   [2021-02-09T22:11:59.574Z] ======================================================================
   [2021-02-09T22:11:59.574Z] ERROR: test_gluon_data.test_multi_worker_forked_data_loader
   [2021-02-09T22:11:59.574Z] ----------------------------------------------------------------------
   [2021-02-09T22:11:59.574Z] Traceback (most recent call last):
   [2021-02-09T22:11:59.574Z]   File "/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest
   [2021-02-09T22:11:59.574Z]     self.test(*self.arg)
   [2021-02-09T22:11:59.574Z]   File "/work/mxnet/tests/python/unittest/common.py", line 226, in test_new
   [2021-02-09T22:11:59.574Z]     mx.nd.waitall()
   [2021-02-09T22:11:59.574Z]   File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 211, in waitall
   [2021-02-09T22:11:59.574Z]     check_call(_LIB.MXNDArrayWaitAll())
   [2021-02-09T22:11:59.574Z]   File "/work/mxnet/python/mxnet/base.py", line 246, in check_call
   [2021-02-09T22:11:59.574Z]     raise get_last_ffi_error()
   [2021-02-09T22:11:59.574Z] mxnet.base.MXNetError: Traceback (most recent call last):
   [2021-02-09T22:11:59.574Z]   [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0xd8) [0x7f0df6da1c48]
   [2021-02-09T22:11:59.574Z]   [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x42b) [0x7f0df6da525b]
   [2021-02-09T22:11:59.574Z]   [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x349) [0x7f0df6db7e69]
   [2021-02-09T22:11:59.574Z]   [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xa50) [0x7f0df6d9b740]
   [2021-02-09T22:11:59.574Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x63dbf50) [0x7f0df6da1f50]
   [2021-02-09T22:11:59.574Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5cde545) [0x7f0df66a4545]
   [2021-02-09T22:11:59.574Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x69) [0x7f0df6dbe0b9]
   [2021-02-09T22:11:59.574Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0x98) [0x7f0df6dc2718]
   [2021-02-09T22:11:59.574Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle const&)+0xcf) [0x7f0df6dbb27f]
   [2021-02-09T22:11:59.574Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61) [0x7f0df16c59e1]
   [2021-02-09T22:11:59.574Z]   File "src/storage/./cpu_shared_storage_manager.h", line 218
   [2021-02-09T22:11:59.574Z] MXNetError: Check failed: count >= 0 (-1 vs. 0) : 
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] access2rohit commented on issue #19877: [v1.x] test_gluon_data unit tests failing

Posted by GitBox <gi...@apache.org>.
access2rohit commented on issue #19877:
URL: https://github.com/apache/incubator-mxnet/issues/19877#issuecomment-777041776


   this PR made changes to gluon data loader #19748 . But the issue is not reproducible on local machine though. CI is unblocked for now. This requires further investigation for root cause


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] josephevans commented on issue #19877: [v1.x] test_gluon_data unit tests failing

Posted by GitBox <gi...@apache.org>.
josephevans commented on issue #19877:
URL: https://github.com/apache/incubator-mxnet/issues/19877#issuecomment-784756307


   Thanks @ptrendx for the fix! Closing this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] josephevans commented on issue #19877: [v1.x] test_gluon_data unit tests failing

Posted by GitBox <gi...@apache.org>.
josephevans commented on issue #19877:
URL: https://github.com/apache/incubator-mxnet/issues/19877#issuecomment-777693865


   I don't think it's related, but I create https://github.com/apache/incubator-mxnet/pull/19886 to test if upgrading python (from 3.6 to 3.7) caused these unit test failures.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] josephevans commented on issue #19877: [v1.x] test_gluon_data unit tests failing

Posted by GitBox <gi...@apache.org>.
josephevans commented on issue #19877:
URL: https://github.com/apache/incubator-mxnet/issues/19877#issuecomment-776409416


   Since we're trying to unblock the v1.x CI pipeline, I am disabling these 2 tests for now in https://github.com/apache/incubator-mxnet/pull/19872. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] josephevans closed issue #19877: [v1.x] test_gluon_data unit tests failing

Posted by GitBox <gi...@apache.org>.
josephevans closed issue #19877:
URL: https://github.com/apache/incubator-mxnet/issues/19877


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] ptrendx commented on issue #19877: [v1.x] test_gluon_data unit tests failing

Posted by GitBox <gi...@apache.org>.
ptrendx commented on issue #19877:
URL: https://github.com/apache/incubator-mxnet/issues/19877#issuecomment-781834904


   See my comment in the other issue about this: https://github.com/apache/incubator-mxnet/issues/19918#issuecomment-781832648


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] access2rohit commented on issue #19877: [v1.x] test_gluon_data unit tests failing

Posted by GitBox <gi...@apache.org>.
access2rohit commented on issue #19877:
URL: https://github.com/apache/incubator-mxnet/issues/19877#issuecomment-777068789


   Raised PR #19879 to revert #19748 and re-enable gluon data loader tests to see if test failure is caused by the PR or not. Unable to repro failure on local instance using master w/o skipping tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] josephevans commented on issue #19877: [v1.x] test_gluon_data unit tests failing

Posted by GitBox <gi...@apache.org>.
josephevans commented on issue #19877:
URL: https://github.com/apache/incubator-mxnet/issues/19877#issuecomment-777574492


   Getting another test failure, looks related to gluon data loader stuff. We really need to dig down and root cause this issue.
   
   ```
   [2021-02-11T06:56:52.649Z] test_gluon_data.test_list_dataset ... terminate called after throwing an instance of 'dmlc::Error'
   [2021-02-11T06:56:52.649Z]   what():  [06:56:48] /work/mxnet/src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-1 vs. 0) : 
   [2021-02-11T06:56:52.649Z] Stack trace:
   [2021-02-11T06:56:52.649Z]   [bt] (0) /work/mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61) [0x7f882bad4491]
   [2021-02-11T06:56:52.649Z]   [bt] (1) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle const&)+0xd3) [0x7f882e523ea3]
   [2021-02-11T06:56:52.649Z]   [bt] (2) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0x98) [0x7f882e527c08]
   [2021-02-11T06:56:52.649Z]   [bt] (3) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x69) [0x7f882e526d39]
   [2021-02-11T06:56:52.649Z]   [bt] (4) /work/mxnet/python/mxnet/../../build/libmxnet.so(+0xc747fa) [0x7f882bdc47fa]
   [2021-02-11T06:56:52.649Z]   [bt] (5) /work/mxnet/python/mxnet/../../build/libmxnet.so(+0xac46bf) [0x7f882bc146bf]
   [2021-02-11T06:56:52.649Z]   [bt] (6) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x5c4) [0x7f882bc20234]
   [2021-02-11T06:56:52.649Z]   [bt] (7) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x33e) [0x7f882bc2694e]
   [2021-02-11T06:56:52.649Z]   [bt] (8) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x190) [0x7f882bc16ba0]
   [2021-02-11T06:56:52.649Z]   [bt] (9) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0xd7) [0x7f882bc143b7]
   [2021-02-11T06:56:52.649Z] 
   [2021-02-11T06:56:52.649Z] 
   [2021-02-11T06:56:52.649Z] /work/runtime_functions.sh: line 1008:     6 Aborted                 (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_unittest.xml --verbose 
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org