You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/07/16 18:13:36 UTC

[GitHub] [incubator-mxnet] guptaanshul201989 opened a new issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

guptaanshul201989 opened a new issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734


   I am trying to train a transformer seq-to-seq model on Sagemaker ( The script I am using works fine when I run it on an EC2 multi gpu instance ).
   
   When I start a training job on sagemaker, the training progresses fine, but it logs a cuda error:
   
   `[03:28:04] src/engine/threaded_engine_perdevice.cc:101: Ignore CUDA Error [03:28:04] /root/pip_build/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess: CUDA: initialization error
   Stack trace:
   [bt] (0) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6dfb0b) [0x7f9f2591cb0b]
   [bt] (1) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3898dd2) [0x7f9f28ad5dd2]
   [bt] (2) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38bc49e) [0x7f9f28af949e]
   [bt] (3) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38aee71) [0x7f9f28aebe71]
   [bt] (4) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a4a21) [0x7f9f28ae1a21]
   [bt] (5) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a5974) [0x7f9f28ae2974]
   [bt] (6) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a) [0x7f9f28d1ce1a]
   [bt] (7) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6e32ba) [0x7f9f259202ba]
   [bt] (8) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >::~vector()+0xc8) [0x7f9f25951818] `
   
   
   I found out that when I initialize dataloader with multiprocessing, I get this error. When I switch thread_pool on, I don't see this error.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

szha commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-670731267


   The problem is a general one that cuda doesn't support fork after initialization. Multiprocessing is one way in which this problem is exposed. @ptrendx does CUDA plan on addressing this limitation?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ndeepesh edited a comment on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

ndeepesh edited a comment on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703035649


   @szha @sandeep-krishnamurthy  Can you link the document with workaround? We are also seeing similar issues with python multiprocessing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] leezu commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

leezu commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703756493


   > We do load one tensor on GPU before we start other processes
   
   This is not safe to do in MXNet. For example, if you call `random.seed` in the child process after the parent process used cuda, it will cause segfault.
   
   You can also refer to https://github.com/apache/incubator-mxnet/issues/4659 https://github.com/apache/incubator-mxnet/issues/19291 (ie the current issue is a duplicate of the #4659)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] sandeep-krishnamurthy commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

sandeep-krishnamurthy commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-670722109


   @leezu @szha @eric-haibin-lin @zhreshold - Multi processing in MXNet is causing multiple issues like this and non-deterministic hangs. Given, 1.x is heavily used in production and will have customers for quite some time - What do you recommend in getting away from multi processing in 1.x?
   
   cc @karan6181 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ptrendx commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703261686


   It's hard to answer exactly why you see this without knowing your training script (or at least the part before you start other processes). That said, the fact that you get the error during NDArrayFree suggests that before you fork you created some NDArrays on the GPU. Maybe the issue is intermittent because it only tries to get rid of those particular NDArrays during garbage collection in Python?
   
   The error itself happens in the child process and I don't believe it should happen in the parent process so as long as you do not need to do anything cuda related in the child processes you should be ok I think.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] ndeepesh commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

ndeepesh commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703035649


   @szha Can you link the document with workaround? We are also seeing similar issues with python multiprocessing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] ndeepesh commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

ndeepesh commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703312835


   Thanks @ptrendx 
   
   We do load one tensor on GPU before we start other processes. We use those processes to prepare and preprocess batches in parallel which in turn gets picked up by parent process (to fed into GPU) via  multiprocessing.Queue. Child processes are only responsible for loading and preparing batches and they have nothing to do with cuda


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] szha edited a comment on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

szha edited a comment on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703170228


   @ndeepesh probably. @ptrendx may know more on the details of the forking issue in CUDA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] ndeepesh commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

ndeepesh commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703184928


   Thanks @szha . @ptrendx Can you help answer some questions below?
   1. We dont see any training failures due to cudaInitialization error in multiprocesssing environment. Is this expected? Are there any side effects to this?
   2. We see this issue intermittently and not all training jobs are affected.
   
   Here is the exact warning/error message we get in logs
   
   ```
   [23:24:18] src/engine/threaded_engine_perdevice.cc:101: Ignore CUDA Error [23:24:18] /root/pip_build/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess: CUDA: initialization error
   Stack trace:
   [bt] (0) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6e305b) [0x7f0d92f4b05b]
   [bt] (1) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a6622) [0x7f0d9610e622]
   [bt] (2) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38c9d1e) [0x7f0d96131d1e]
   [bt] (3) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38bc6e1) [0x7f0d961246e1]
   [bt] (4) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38b2291) [0x7f0d9611a291]
   [bt] (5) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38b31e4) [0x7f0d9611b1e4]
   [bt] (6) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a) [0x7f0d96356e3a]
   [bt] (7) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6e680a) [0x7f0d92f4e80a]
   [bt] (8) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(MXNDArrayFree+0x54) [0x7f0d96082f44]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] szha edited a comment on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

szha edited a comment on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703044661


   @ndeepesh the workaround would be to avoid using GPU context before initializing for the multiprocess. This hasn't been documented yet and I think it would be great to include it in https://github.com/apache/incubator-mxnet/blob/master/docs/static_site/src/pages/api/developer_guide/debugging_and_performance_optimization_tips.md 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] sandeep-krishnamurthy commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

sandeep-krishnamurthy commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-670800688


   Thank you @szha and @ptrendx .
   
   @szha I will start a new thread. I was referring to python multi processing in data loader leading to deadlocks in distributed training.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ptrendx commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-670760100


   I don't believe enabling forks after initialization of CUDA is planned. Generally this is handled (as @leezu mentioned) by spawning the processes before launching operations on the GPU.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

szha commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-670789846


   @sandeep-krishnamurthy so I think the root cause of this won't be fixed and we can document better to help users avoid issue like this. Maybe we can have a flag that exposed whether CUDA has been initialized yet and use it to disable forking in data loader?
   
   By the way, what are the nondeterministic hanging issues?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ndeepesh commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

ndeepesh commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703051665


   Thanks @szha . Also is this issue intermittent? We dont see this issue for all our training jobs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] leezu edited a comment on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

leezu edited a comment on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703756493


   > We do load one tensor on GPU before we start other processes
   
   This is not safe to do in MXNet. For example, if you call `mx.random.seed` in the child process after the parent process used cuda, it will cause segfault.
   
   You can also refer to https://github.com/apache/incubator-mxnet/issues/4659 https://github.com/apache/incubator-mxnet/issues/19291 (ie the current issue is a duplicate of the #4659)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] szha commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

szha commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703170228


   @ndeepesh probably. @ptrendx may know more on the details of this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] szha commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

szha commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-703044661


   @ndeepesh the workaround would be to avoid using GPU context before initializing for the multiprocess.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] leezu commented on issue #18734: Multi-gpu Mxnet training in sagemaker gives cuda error when dataloader is using multiprocessing

Posted by GitBox <gi...@apache.org>.

leezu commented on issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734#issuecomment-659584345


   Try not using cuda before enabling the multiprocessor as a workaround. There are various bugs in the multiprocessing implementation in MXNet


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org