You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/05/11 22:07:15 UTC

[GitHub] moshelooks opened a new issue #10910: Feature Request: loading gluon trainer states with different contexts from saved states

moshelooks opened a new issue #10910: Feature Request: loading gluon trainer states with different contexts from saved states
URL: https://github.com/apache/incubator-mxnet/issues/10910
 
 
   Gluon blocks let you load their parameters in particular context(s) by specifying `ctx` when calling `Block.load_params()`. 
   
   Gluon trainers however do not support this; when you call `Trainer.load_states()`, the context(s) are whatever devices we used when `save_states()` was called (excepting when loading states from GPU onto a CPU-only build, in which case they get moved to the CPU).
   
   This limitation breaks deserialization in some cases, e.g. there is no way to load staved trainer states from `gpu(N)` on a GPU machine with `< N` GPUs; this even causes a core dump:
   
   ```
   terminate called after throwing an instance of 'dmlc::Error'
     what():  [20:33:57] src/engine/threaded_engine.cc:320: Check failed: exec_ctx.dev_id < device_count_ (2 vs. 2) Invalid GPU Id: 2, Valid device id should be less than device_count: 2
   
   Stack trace returned 10 entries:
   [bt] (0) /home/ubuntu/apprente/ve/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x32d7ea) [0x7f7915a567ea]
   [bt] (1) /home/ubuntu/apprente/ve/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x32de11) [0x7f7915a56e11]
   [bt] (2) /home/ubuntu/apprente/ve/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x246402c) [0x7f7917b8d02c]
   [bt] (3) /home/ubuntu/apprente/ve/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x245e704) [0x7f7917b87704]
   [bt] (4) /home/ubuntu/apprente/ve/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x261a2b1) [0x7f7917d432b1]
   [bt] (5) /home/ubuntu/apprente/ve/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4672ea) [0x7f7915b902ea]
   [bt] (6) /home/ubuntu/apprente/ve/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x263355d) [0x7f7917d5c55d]
   [bt] (7) /home/ubuntu/apprente/ve/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2635551) [0x7f7917d5e551]
   [bt] (8) /home/ubuntu/apprente/ve/lib/python3.6/site-packages/mxnet/libmxnet.so(MXNDArrayLoadFromRawBytes+0x6d) [0x7f7917b1677d]
   [bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f7990e56e40]
   
   Aborted (core dumped)
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services