You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/11/15 08:22:51 UTC
[GitHub] rahul003 opened a new pull request #8662: 2bit gradient compression

rahul003 opened a new pull request #8662: 2bit gradient compression
URL: https://github.com/apache/incubator-mxnet/pull/8662
 
 
   ## Description ##
   Implements 2bit gradient compression by quantizing each value in gradient array to 2bits using two user specified thresholds, one for positive and one for negative values. 
   
   ### Important files to review
   GC
   - gc-inl.h
   - gc.cc
   
   KVStore local
   - comm.h
   
   KVStore dist
   - kvstore_dist.h
   - kvstore_dist_server.h
   
   Documentation about gradient compression
   - kvstore.py
   
   ## Checklist ##
   ### Essentials ###
   - [x] Passed code style checking (`make lint`)
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [x] All changes have test coverage
   - [x] For user-facing API changes, API doc string has been updated.
   - [x] To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [x] Gradient compression class
   - [x] Reduce operation in kvstore_local / comm.h
   - [x] Distributed kvstore changes at worker and server
   - [x] Tests for local kvstore, distributed kvstore with predefined and random data. The results have been compared with expected values by implementing this logic in python.
   - [x] API changes for Kvstore, Module and Trainer in python
   - [x] Addressed comments from last PR
   
   ## Comments ##
   ### Problem 
   When training large scale deep learning models especially with distributed training, communication becomes a bottleneck for networks whose computation is not high compared to the communication. 
   
   ### Approach
   We can compress the gradients by considering only those elements that exceed a threshold. Only these elements are encoded and sent. The elements of the gradient that are near zero can safely be delayed by aggregating them in a residual array. When the updated residual with gradient of next iterations exceed the threshold, these values are sent. Effectively these values are updated at a lower frequency. 
   On the receiver's end we decompress the data and use the decompressed weights.
   Specifically in this PR, 2bit quantization has been implemented.
   
   ### Two bit quantization
   Any positive value greater than or equal to the threshold is set to one value (say 11), any negative value whose absolute value is greater or equal to the threshold is set to second value (say 10), and others are set to third value (say 0). We need three values to represent data in this fashion and hence two bits. We understand this leads to one bit going waste, but that's an optimization to be done later. The error in quantization is accumulated as residual and carried over to the next iterations. This is added in the next iteration to the gradient before quantizing. 
   An example below with thresholds of -2.0 and 2.0
   ![Quantization at work](https://i.imgur.com/AtBVg92.png)
   
   ### Format of compressed gradient
   Eac element, represents upto 16 elements in the original array. For the example above, we get an element whose binary representation is 
   ```00 11 00 10 11 00 00 10  0000000000000000```
   
   ### Local kvstore
   When using local kvstore, gradients compression only happens when using device communication. When gradients are pushed, before summing them up (Reduce), quantization and dequantization happen.
   Example: Say we have 4 GPUs, and the gradients are being summed up on GPU0. Each device quantizes gradients, then sends quantized gradient to GPU0, which performs dequantization of this data before merging it with values from other GPUs. Note that here, there is no need to quantize gradients from GPU0 itself, but it is still being done so that there is no bias for the samples which were processed by GPU0. 
   
   ### Dist kvstore
   When the set_compress method for kvstore is called, each worker sets those compress params and one worker sends these params to all servers. From then on, when before each value is pushed to the server, it is quantized. The server dequantizes the data and stores it as an array of the original size. When values are pulled from the server, it returns an array of the original size. The same happens when each server is handling shards of the data.
   
   ### Usage
   The reason I used a dictionary compress_params for the arguments was to ensure uniformity when we extend this to other quantization techniques. This is because each technique might take different type and number of parameters.
   #### KVstore
   ```
   kv = mx.kv.create('dist_sync')
   kv.set_gradient_compression({'compression':'2bit', 'threshold':0.5})
   ```
   #### Module
   ```
   mod = mx.mod.Module(net, compression_params={'compression':'2bit', 'threshold':0.5})
    ```
   #### Gluon Trainer
   ```
   trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1}, 
                           compression_params={'compression':'2bit', 'threshold':0.5})
   ```
   
   ### Results
   1. Shows speedup when large models make communication expensive. Below table lists communication cost per batch
   
   | Model size | Without compression  | With 2bit compression |
   | --- | --- | --- |
   | 50MB | 0.26s | 0.38s |
   | 93MB |  0.38s | 0.45s |
   | 139MB | 0.731s | 0.696s |
   | 230MB | 1.175s | 1s |
   | 462MB | 2.659s | 1.769s |
   
   2.  For MLP with 4 fully connected layers of 1500 size and one fully connected layer of 3000 size
   | input dim | model size | speedup | batch size(on each gpu) |
   | --- | --- | --- | --- |
   | 1500 | 139MB | 1.17x | 1024 |
   | 270000 | 900MB | 1.7x | 128 |
   
   3. Accuracy 
   ![LSTM on Penntree bank](https://i.imgur.com/bzbxrlF.png)
   ![MNIST on MLP](https://imgur.com/PVzGhoj.png)
   I see some loss in accuracy on some other models, but below paper mentions that pretraining without gradient compression on a small data for small time, helps convergence. I'm running tests for that. 
   
   Reference (although compressed representation is different http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf ) 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services