You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/03/26 07:01:59 UTC
[GitHub] [incubator-mxnet] shuo-ouyang commented on issue #9558: Signum with grad compression

shuo-ouyang commented on issue #9558: Signum with grad compression
URL: https://github.com/apache/incubator-mxnet/pull/9558#issuecomment-604264502
 
 
   > It's fairly easy to change the compression to use 1bit. You need to change the bitmasks in the struct quantize_signum and dequantize_signum, and also change the start and end indices to cover 32 values in one call.
   > You could use masks like 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 for the values which are greater than 0. As an optimization don't encode negative values, and remove the second if statement in quantize. Let them be 0. During dequantization change them to -1.
   > 
   > Yes you'll not lose accuracy if the server forwards the differences to worker. But the way mxnet kvstore currently works is that it pushes the full model to the workers. As against this, if we tried to send differences then we would need to send (num_workers * compressed_size) data as updates to each worker. This is because we can't merge the updates from different workers without decompressing, merging and then compressing them back. Also if we do this, then if num_workers becomes higher than the compression factor, then gradient compression would give no speedup.
   
   @rahul003 
   
   Hi, I have implemented a similar 1bit gradient compression algorithm with your advice. However, when I train resnet110 on CIFAR-10 dataset to compare my implementation to 2bit compression and no compression algorithms, I found that the speedup of gradient quantization for ResNet training seems not significantly. The training command and logs are shown as follows, and I deployed training tasks across four nodes (each node is equipped four K80 GPUs), in which one parameter server and three workers. Is there any incorrect setup in my training process?
   
   ps: I use the example code in dictionary`example/image-classification`
   mxnet version: 1.4.0
   cuda: 8.0
   
   **training command:**
   ```shell
   python ../../tools/launch.py --launcher ssh -H hosts -s 1 -n 3 python train_cifar10.py --gc-type 2bit --gc-threshold 1 --kv-store dist_sync --num-epochs 200 --batch-size 128 --lr-step-epochs 100,150 --wd 0.0001 --lr 0.1 --lr-factor 0.1 --network resnet --gpus 0,1,2,3
   ```
   
   **100th epoch:**
   ||  No Quantization   | 2bit Quantization  | 1bit Quantization  |
   |:----:|  :----:  | :----:  |:----:  |
   |time cost (second)| 19.27  | 19.777 |18.545 |
   |validation accuracy| 0.89122  | 0.887921 |0.885871 |
   
   **150th epoch:**
   ||  No Quantization   | 2bit Quantization  | 1bit Quantization  |
   |:----:|  :----:  | :----:  |:----:  |
   |time cost (second)| 18.73  | 22.357 |20.339 |
   |validation accuracy| 0.92758  | 0.929688 |0.929109 |
   
   
   **200th epoch:**
   ||  No Quantization   | 2bit Quantization  | 1bit Quantization  |
   |:----:|  :----:  | :----:  |:----:  |
   |time cost (second)| 19.048  | 18.846 |19.649 |
   |validation accuracy| 0.929988  | 0.935397 |0.937500 |

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services