You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/11/21 06:35:58 UTC

[GitHub] YutingZhang edited a comment on issue #13318: Improving multi-processing reliability for gluon DataLoader

YutingZhang edited a comment on issue #13318: Improving multi-processing reliability for gluon DataLoader
URL: https://github.com/apache/incubator-mxnet/pull/13318#issuecomment-440552558
 
 
   @zhreshold It was actually also a bit confusing to me, but that was what happened. 
   
   One guess:
   
   Is there any size limit or get-put sync of the `data_queue`? Is it possible that the worker got stuck at the `data_queue.put` ? The `fetcher_loop` thread can get joined before the workers (this is possible in original code, and more likely in my PR), and then the `data_queue` is full can the `put` get stuck. Is there any such possibility? 
   
   By the way, I tried to join the workers before sending `(None, None)` to the `fetcher_loop` thread, but this can cause the `fetcher_loop` to get stuck at `data_queue.get` (the requested data is not in the queue, and no worker will put the data). This can stuck the main thread, if we try to join the `fetcher_loop` thread or leave a dangling thread otherwise.
   
   If the above guess is true, a possibly more decent solution is to add the logic of joining workers in the `fetcher_loop`.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services