You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/01/25 00:31:18 UTC

[GitHub] reminisce opened a new pull request #9552: [REQUEST FOR REVIEW | DO NOT MERGE] Model Quantization with Calibration

reminisce opened a new pull request #9552: [REQUEST FOR REVIEW | DO NOT MERGE] Model Quantization with Calibration
URL: https://github.com/apache/incubator-mxnet/pull/9552

## Description ##
This PR implements model **quantization** by adopting the [TensorFlow approach](https://www.tensorflow.org/performance/quantization) with **calibration** by borrowing the idea from Nvidia's [TensorRT](http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf). The focus of this work is on keeping quantized models (ConvNets for now) inference accuracy loss under control when compared to their corresponding FP32 models. It also provides a framework in MXNet for easily adding high-performance operators for low-bit operations generated by using [TVM](https://github.com/dmlc/tvm).

This is a joint work of @ZihengJiang and @reminisce.
- @ZihengJiang implemented the model quantization flow and quantized operators by calling [cuDNN APIs](http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-introduction) for convolution, pooling, and fully-connected operators.
- @reminisce implemented the calibration flow and refactored the operator implementation into using nnvm interfaces as well as wrote unit tests and examples, designed user-level API, and fixed bugs to make the code mergeable to MXNet master branch.

## Details ##
Please see the following slides for more details on implementation and benchmark results.
[quantization_github.pptx](https://github.com/apache/incubator-mxnet/files/1662073/quantization_github.pptx)

## Code Structure ##
- Backend: `src/operator/quantization/` contains quantized operators, quantization and calibration flow, and quantization util functions.
- Frontend: `python/mxnet/quantization.py` contains one user API for generating quantized models from FP32 models.

## Notes ##
- Since we have used cuDNN for implementing the quantized operators, the quantized models generated in the examples of this PR can only run on the Nvidia GPUs supporting the [dp4a instruction](https://devblogs.nvidia.com/mixed-precision-programming-cuda-8/) for inference. We performed our benchmarks on [AWS P3 instances](https://aws.amazon.com/ec2/instance-types/p3/).
- The inference speed of the quantized models is about 50% slower than FP32 models. This is majorly resulted from three transpose operations in the quantized convolution operator to transform data layouts between NCHW and NHWC in order to call [`cudnnConvolutionForward`](http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionForward). In addition, we have noticed that even without transposing data layouts, the INT8 convolution of NHWC is slower than FP32 of NCHW for big images such as `(64, 56, 56)`. In the future, we hope to leverage the strength of [TVM](https://github.com/dmlc/tvm) to generate high-performance INT8 operators to replace the current implementation of calling cuDNN for quantized convolution.
- The unit tests of quantization are put under `tests/python/quantization` because it needs a P3 instance to run. @marcoabreu is working on setting up the testing environment. Once it's done, we will submit the unit tests under that folder to a different label from the commonly used one.

We would like to thank all the following people for discussion, suggestion, providing datasets, and guidance on configuring examples. @mli @piiswrong @zhreshold @astonzhang @szha @eric-haibin-lin @srochel @madjam @bhavinthaker @marcoabreu

We would appreciate everyone's efforts of reviewing this PR.

@cjolivier01 @anirudh2290 @rahul003

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services