You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/10/15 23:02:38 UTC

[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput

anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput
URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-542440290
 
 
   I'm facing some design level challenges to properly implement Priority based update (P3) on top of PushPull API. MXNet does a simple load balancing before pushing or pulling key-values by splitting NDArrays equally to the parameter servers. P3 requires a round-robin style parameter distribution which means slicing a large NDArray into thousands of smaller ones. Much more granular than current default distribution strategy and each PS would get more than one slice.
   
   With the way mxnet and ps-lite designed right now, ps-lite assumes a single ZPush/ZPull/ZPushPull belongs to a single layer/NDArray. It also assumes that one slice only belong to one PS. These assumption need to be broken for implementing P3. What I have done right now is to add round-robin (RR) distribution strategy along with the default one and use a boolean flag to switch between these two. When user chooses to use RR, KVStore consider each slice as separate key-value pair. Otherwise fallback to the default mode.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services