You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/12/04 15:23:59 UTC
[GitHub] [incubator-mxnet] ciyongch commented on a change in pull request #19587: [FEATURE] Restore Quantization API to MXNet

ciyongch commented on a change in pull request #19587:
URL: https://github.com/apache/incubator-mxnet/pull/19587#discussion_r536164436



##########
File path: example/quantization/README.md
##########
@@ -0,0 +1,184 @@
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements.  See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership.  The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License.  You may obtain a copy of the License at -->
+<!--- -->
+<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
+<!--- -->
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied.  See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+
+# Model Quantization with Calibration Examples
+
+This folder contains examples of quantizing a FP32 model with Intel® MKL-DNN to (U)INT8 model.
+
+<h2 id="0">Contents</h2>
+
+* [1. Model Quantization with Intel® MKL-DNN](#1)
+<h2 id="1">Model Quantization with Intel® MKL-DNN</h2>
+
+Intel® MKL-DNN supports quantization with subgraph features on Intel® CPU Platform and can bring performance improvements on the [Intel® Xeon® Scalable Platform](https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-platform.html). To apply quantization flow to your project directly, please refer [Optimize custom models with MKL-DNN backend](#TODO(agrygielski)).
+
+```
+usage: python imagenet_gen_qsym_mkldnn.py [-h] [--model MODEL] [--epoch EPOCH]
+                                          [--no-pretrained] [--batch-size BATCH_SIZE]
+                                          [--calib-dataset CALIB_DATASET]
+                                          [--image-shape IMAGE_SHAPE]
+                                          [--data-nthreads DATA_NTHREADS]
+                                          [--num-calib-batches NUM_CALIB_BATCHES]
+                                          [--exclude-first-conv] [--shuffle-dataset]
+                                          [--calib-mode CALIB_MODE]
+                                          [--quantized-dtype {auto,int8,uint8}]
+                                          [--quiet]
+
+Generate a calibrated quantized model from a FP32 model with Intel MKL-DNN support
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --model MODEL         model to be quantized. If no-pretrained is set then
+                        model must be provided to `model` directory in the same path
+                        as this python script, default is `resnet50_v1`
+  --epoch EPOCH         number of epochs, default is `0`
+  --no-pretrained       If enabled, will not download pretrained model from
+                        MXNet or Gluon-CV modelzoo, default is `False`
+  --batch-size BATCH_SIZE
+                        batch size to be used when calibrating model, default is `32`
+  --calib-dataset CALIB_DATASET
+                        path of the calibration dataset, default is `data/val_256_q90.rec`
+  --image-shape IMAGE_SHAPE
+                        number of channels, height and width of input image separated by comma,
+                        default is `3,224,224`
+  --data-nthreads DATA_NTHREADS
+                        number of threads for data loading, default is `0`
+  --num-calib-batches NUM_CALIB_BATCHES
+                        number of batches for calibration, default is `10`
+  --exclude-first-conv  excluding quantizing the first conv layer since the
+                        input data may have negative value which doesn't
+                        support at moment
+  --shuffle-dataset     shuffle the calibration dataset
+  --calib-mode CALIB_MODE
+                        calibration mode used for generating calibration table
+                        for the quantized symbol; supports 1. none: no
+                        calibration will be used. The thresholds for
+                        quantization will be calculated on the fly. This will
+                        result in inference speed slowdown and loss of
+                        accuracy in general. 2. naive: simply take min and max
+                        values of layer outputs as thresholds for
+                        quantization. In general, the inference accuracy
+                        worsens with more examples used in calibration. It is
+                        recommended to use `entropy` mode as it produces more
+                        accurate inference results. 3. entropy: calculate KL
+                        divergence of the fp32 output and quantized output for
+                        optimal thresholds. This mode is expected to produce
+                        the best inference accuracy of all three kinds of
+                        quantized models if the calibration dataset is
+                        representative enough of the inference dataset.
+                        default is `entropy`
+  --quantized-dtype {auto,int8,uint8}
+                        quantization destination data type for input data,
+                        default is `auto`
+  --quiet               suppress most of log
+```
+
+A new benchmark script `launch_inference_mkldnn.sh` has been designed to launch performance benchmark for float32 or int8 image-classification models with Intel® MKL-DNN.
+```
+usage: bash ./launch_inference_mkldnn.sh -s symbol_file [-b batch_size] [-iter iteraton] [-ins instance] [-c cores/instance] [-h]
+
+arguments:
+  -h, --help                show this help message and exit
+  -s, --symbol_file         symbol file for benchmark, required
+  -b, --batch_size          inference batch size
+                            default: 64
+  -iter, --iteration        inference iteration
+                            default: 500
+  -ins, --instance          launch multi-instance inference
+                            default: one instance per socket
+  -c, --core                number of cores per instance
+                            default: divide full physical cores
+
+example: resnet int8 performance benchmark on c5.24xlarge(duo sockets, 24 physical cores per socket).
+
+    bash ./launch_inference_mkldnn.sh -s ./model/resnet50_v1-quantized-5batches-naive-symbol.json
+
+will launch two instances for throughput benchmark and each instance will use 24 physical cores.
+```
+
+
+<h3 id='3'>ResNetV1</h3>

Review comment:
       Previously, we listed all the models that supports quantization for better reference, is there any limitation to support other models or any other reasons?

##########
File path: src/operator/subgraph/mkldnn/mkldnn_conv.cc
##########
@@ -137,17 +147,22 @@ void SgMKLDNNConvOperator::Forward(const OpContext &ctx,
   auto in_beta = mkldnn_param.with_bn ? (idx++) : 0;
   auto in_mean = mkldnn_param.with_bn ? (idx++) : 0;
   auto in_var = mkldnn_param.with_bn ? (idx++) : 0;
-  auto in_sum = mkldnn_param.with_sum ? (idx++) : 0;
+  auto in_sum = mkldnn_param.with_sum ? (mkldnn_param.dedup_sum? in_data : idx++) : -1;
   float data_min =
       mkldnn_param.quantized ? inputs[idx++].data().dptr<float>()[0] : 0.0;
   float data_max =
       mkldnn_param.quantized ? inputs[idx++].data().dptr<float>()[0] : 0.0;
-  float sum_min = (mkldnn_param.with_sum && mkldnn_param.quantized)
-                      ? inputs[idx++].data().dptr<float>()[0]
-                      : 0.0;
-  float sum_max = (mkldnn_param.with_sum && mkldnn_param.quantized)
-                      ? inputs[idx++].data().dptr<float>()[0]
-                      : 0.0;
+  float sum_min = 0.0f;

Review comment:
       Seems it's a new feature compared to v1.x, is there any test cases to cover it?

##########
File path: example/quantization/imagenet_gen_qsym_mkldnn.py
##########
@@ -0,0 +1,274 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import argparse
+import logging
+import os
+import re
+import sys
+from inspect import currentframe, getframeinfo
+
+import mxnet as mx
+from mxnet import gluon
+from mxnet.contrib.quantization import quantize_net
+from mxnet.gluon.data import DataLoader
+from mxnet.gluon.data.vision import transforms
+from mxnet.gluon.model_zoo.vision import get_model
+
+sys.path.append('../..')
+from tools.rec2idx import IndexCreator
+
+
+def download_calib_dataset(dataset_url, calib_dataset, logger=None):
+    if logger is not None:
+        logger.info('Downloading calibration dataset from %s to %s' % (dataset_url, calib_dataset))
+    mx.test_utils.download(dataset_url, calib_dataset)
+
+def get_from_gluon(model_name, classes=1000, logger=None):
+    dir_path = os.path.dirname(os.path.realpath(__file__))
+    model_path = os.path.join(dir_path, 'model')
+    if logger is not None:
+        logger.info('Converting model from Gluon-CV ModelZoo %s... into path %s' % (model_name, model_path))
+    net = get_model(name=model_name, classes=classes, pretrained=True)
+    prefix = os.path.join(model_path, model_name)
+    return net, prefix
+
+def regex_find_excluded_symbols(patterns_dict, model_name):
+    for key, value in patterns_dict.items():
+        if re.search(key, model_name) is not None:
+            return value
+    return None
+
+def get_exclude_symbols(model_name, exclude_first_conv):
+    # Grouped supported models at the time of commit:
+    # alexnet
+    # densenet121, densenet161
+    # densenet169, densenet201
+    # inceptionv3
+    # mobilenet0.25, mobilenet0.5, mobilenet0.75, mobilenet1.0,
+    # mobilenetv2_0.25, mobilenetv2_0.5, mobilenetv2_0.75, mobilenetv2_1.0
+    # resnet101_v1, resnet152_v1, resnet18_v1, resnet34_v1, resnet50_v1
+    # resnet101_v2, resnet152_v2, resnet18_v2, resnet34_v2, resnet50_v2
+    # squeezenet1.0, squeezenet1.1
+    # vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19, vgg19_bn
+    exclude_symbol_regex = {
+        'mobilenet[^v]': ['mobilenet_hybridsequential0_flatten0_flatten0', 'mobilenet_hybridsequential0_globalavgpool2d0_fwd'],
+        'mobilenetv2': ['mobilenetv2_hybridsequential1_flatten0_flatten0'],
+        # resnetv2_hybridsequential0_hybridsequential0_bottleneckv20_batchnorm0_fwd is excluded for the sake of accuracy
+        'resnet.*v2': ['resnetv2_hybridsequential0_flatten0_flatten0', 'resnetv2_hybridsequential0_hybridsequential0_bottleneckv20_batchnorm0_fwd'],
+        'squeezenet1': ['squeezenet_hybridsequential1_flatten0_flatten0'],
+    }
+    excluded_sym_names = regex_find_excluded_symbols(exclude_symbol_regex, model_name)
+    if excluded_sym_names is None:
+        excluded_sym_names = []
+    if exclude_first_conv:
+        first_conv_regex = {
+            'alexnet': ['alexnet_hybridsequential0_conv2d0_fwd'],
+            'densenet': ['densenet_hybridsequential0_conv2d0_fwd'],
+            'inceptionv3': ['inception3_hybridsequential0_hybridsequential0_conv2d0_fwd'],
+            'mobilenet[^v]': ['mobilenet_hybridsequential0_conv2d0_fwd'],
+            'mobilenetv2': ['mobilenetv2_hybridsequential0_conv2d0_fwd'],
+            'resnet.*v1': ['resnetv1_hybridsequential0_conv2d0_fwd'],
+            'resnet.*v2': ['resnetv2_hybridsequential0_conv2d0_fwd'],
+            'squeezenet1': ['squeezenet_hybridsequential0_conv2d0_fwd'],
+            'vgg': ['vgg_hybridsequential0_conv2d0_fwd'],
+        }
+        excluded_first_conv_sym_names = regex_find_excluded_symbols(first_conv_regex, model_name)
+        if excluded_first_conv_sym_names is None:
+            raise ValueError('Currently, model %s is not supported in this script' % model_name)
+        excluded_sym_names += excluded_first_conv_sym_names
+    return excluded_sym_names
+
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Generate a calibrated quantized model from a FP32 model with Intel MKL-DNN support')
+    parser.add_argument('--model', type=str, default='resnet50_v1',
+                        help='model to be quantized. If no-pretrained is set then'
+                             'model must be provided to `model` directory in the same path'
+                             'as this python script')
+    parser.add_argument('--epoch', type=int, default=0,
+                        help='number of epochs, default is 0')
+    parser.add_argument('--no-pretrained', action='store_true', default=False,
+                        help='If enabled, will not download pretrained model from MXNet or Gluon-CV modelzoo.')
+    parser.add_argument('--batch-size', type=int, default=32)
+    parser.add_argument('--calib-dataset', type=str, default='data/val_256_q90.rec',
+                        help='path of the calibration dataset')
+    parser.add_argument('--image-shape', type=str, default='3,224,224',
+                        help='number of channels, height and width of input image separated by comma')
+    parser.add_argument('--data-nthreads', type=int, default=0,
+                        help='number of threads for data loading')
+    parser.add_argument('--num-calib-batches', type=int, default=10,
+                        help='number of batches for calibration')
+    parser.add_argument('--exclude-first-conv', action='store_true', default=False,
+                        help='excluding quantizing the first conv layer since the'
+                             ' input data may have negative value which doesn\'t support at moment' )
+    parser.add_argument('--shuffle-dataset', action='store_true',
+                        help='shuffle the calibration dataset')
+    parser.add_argument('--calib-mode', type=str, default='entropy',
+                        help='calibration mode used for generating calibration table for the quantized symbol; supports'
+                             ' 1. none: no calibration will be used. The thresholds for quantization will be calculated'
+                             ' on the fly. This will result in inference speed slowdown and loss of accuracy'
+                             ' in general.'
+                             ' 2. naive: simply take min and max values of layer outputs as thresholds for'
+                             ' quantization. In general, the inference accuracy worsens with more examples used in'
+                             ' calibration. It is recommended to use `entropy` mode as it produces more accurate'
+                             ' inference results.'
+                             ' 3. entropy: calculate KL divergence of the fp32 output and quantized output for optimal'
+                             ' thresholds. This mode is expected to produce the best inference accuracy of all three'
+                             ' kinds of calibration modes if the calibration dataset is representative enough of the'
+                             ' inference dataset.')
+    parser.add_argument('--quantized-dtype', type=str, default='auto',
+                        choices=['auto', 'int8', 'uint8'],
+                        help='quantization destination data type for input data')
+    parser.add_argument('--quiet', action='store_true', default=False,
+                        help='suppress most of log')
+    args = parser.parse_args()
+    ctx = mx.cpu(0)
+    logger = None
+
+    if not args.quiet:
+        logging.basicConfig()
+        logger = logging.getLogger('logger')
+        logger.setLevel(logging.INFO)
+
+    if logger:
+        logger.info(args)
+        logger.info('shuffle_dataset=%s' % args.shuffle_dataset)
+        logger.info('calibration mode set to %s' % args.calib_mode)
+
+    calib_mode = args.calib_mode
+
+    # download calibration dataset
+    if calib_mode != 'none':
+        idx_file_name = os.path.splitext(args.calib_dataset)[0] + '.idx'
+        if not os.path.isfile(idx_file_name):
+            download_calib_dataset('http://data.mxnet.io/data/val_256_q90.rec', args.calib_dataset)
+            creator = IndexCreator(args.calib_dataset, idx_file_name)
+            creator.create_index()
+            creator.close()
+
+    # get image shape
+    image_shape = args.image_shape
+    data_shape = [(1,) + tuple(int(i) for i in image_shape.split(','))]
+
+    # check if directory for output model exists
+    dir_path = os.path.dirname(os.path.realpath(__file__))
+    dir_path = os.path.join(dir_path, 'model')
+    if not os.path.exists(dir_path):
+        os.mkdir(dir_path) # without try catch block as we expect to finish
+                           # script if it fail
+
+    # download model
+    if not args.no_pretrained:
+        if logger:
+            logger.info('Get pre-trained model from Gluon-CV modelzoo.')
+            logger.info('If you want to use custom model, please set --no-pretrained.')
+        net, prefix = get_from_gluon(model_name=args.model, classes=1000, logger=logger)
+        rgb_mean = '0.485,0.456,0.406'
+        rgb_std = '0.229,0.224,0.225'

Review comment:
       Does this `mean` and `std` the new value to gluon models? As I remember `mean` and `std` was ~100+ and ~58 previously.

##########
File path: tests/python/quantization/test_quantization.py
##########
@@ -0,0 +1,1210 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Some of the tests using CUDNN require a special GPU instruction called dp4a.
+Ref: http://images.nvidia.com/content/pdf/tesla/184457-Tesla-P4-Datasheet-NV-Final-Letter-Web.pdf
+"""
+import os
+import mxnet as mx
+import numpy as np
+from mxnet.gluon.model_zoo import vision
+from mxnet.test_utils import assert_almost_equal, assert_exception, rand_ndarray, rand_shape_nd, same, DummyIter
+from common import xfail_when_nonstandard_decimal_separator
+from mxnet.io import NDArrayIter
+import unittest
+import operator
+
+
+def initialize_block_params(block, initializer):
+    for name, param in block.collect_params('.*gamma|.*running_var|.*moving_var').items():
+        param.initialize(mx.init.Constant(1))
+    for name, param in block.collect_params('.*beta|.*bias|.*moving_mean|.*running_mean').items():
+        param.initialize(mx.init.Constant(0))
+    for name, param in block.collect_params('.*weight').items():
+        param.initialize(initializer)
+
+def collect_block_args_aux(block, sym):
+  arg_params, aux_params = dict(), dict()
+  for k, v in block.collect_params().items():
+    if k in sym.list_arguments():
+      arg_params[k]= v._reduce()
+    elif k in sym.list_auxiliary_states():
+      aux_params[k]= v._reduce()
+  return arg_params, aux_params
+
+def is_test_for_gpu():
+    return mx.current_context().device_type == 'gpu'
+
+
+def is_test_for_mkldnn():
+    return (mx.current_context().device_type == 'cpu'
+            and os.environ.get('ENABLE_MKLDNN_QUANTIZATION_TEST') == '1')
+
+
+def is_test_for_native_cpu():
+    return (mx.current_context().device_type == 'cpu'
+            and os.environ.get('ENABLE_MKLDNN_QUANTIZATION_TEST') == None)
+
+
+def test_quantize_float32_to_int8():
+    shape = rand_shape_nd(4)
+    data = rand_ndarray(shape, 'default', dtype='float32')
+    min_range = mx.nd.min(data)
+    max_range = mx.nd.max(data)
+    qdata, min_val, max_val = mx.nd.contrib.quantize(data, min_range, max_range, out_type='int8')
+    data_np = data.asnumpy()
+    min_range = min_range.asscalar()
+    max_range = max_range.asscalar()
+    real_range = np.maximum(np.abs(min_range), np.abs(max_range))
+    quantized_range = 127.0
+    scale = quantized_range / real_range
+    assert qdata.dtype == np.int8
+    assert min_val.dtype == np.float32
+    assert max_val.dtype == np.float32
+    assert same(min_val.asscalar(), -real_range)
+    assert same(max_val.asscalar(), real_range)
+    qdata_np = (np.sign(data_np) * np.minimum(np.abs(data_np) * scale + 0.5, quantized_range)).astype(np.int8)
+    assert_almost_equal(qdata.asnumpy(), qdata_np, atol = 1)
+
+
+def test_dequantize_int8_to_float32():
+
+    def get_test_data(real_range, qdata_np):
+        qdata = mx.nd.array(qdata_np, dtype=np.int8)
+        min_range = mx.nd.array([-real_range], dtype=np.float32)
+        max_range = mx.nd.array([real_range], dtype=np.float32)
+        return qdata, min_range, max_range
+
+    def baseline_dequantization(qdata, real_range, qdata_np):
+        quantized_range = 127.0
+        scale = real_range / quantized_range
+        data_np = qdata_np * scale
+        return data_np
+
+    def test_nd_array_dequantization(qdata, min_range, max_range, expected_result):
+        data = mx.nd.contrib.dequantize(qdata, min_range, max_range, out_type='float32')
+        assert data.dtype == np.float32
+        assert_almost_equal(data.asnumpy(), expected_result, atol = 1)
+
+    def test_symbolic_api_dequantization(qdata, min_range, max_range, expected_result):
+        sym_data = mx.sym.Variable('data')
+        sym_min_range = mx.sym.Variable('min_range')
+        sym_max_range = mx.sym.Variable('max_range')
+        dequant = mx.sym.contrib.dequantize(sym_data, sym_min_range,
+                                            sym_max_range, out_type='float32')
+        out = dequant._bind(ctx=mx.current_context(),
+                           args={'data':qdata, 'min_range':min_range, 'max_range':max_range})
+        data = out.forward()[0]
+        assert data.dtype == np.float32
+        assert_almost_equal(data.asnumpy(), expected_result, atol = 1)
+
+    real_range = 128
+    shape = rand_shape_nd(4)
+    qdata_np = np.random.uniform(low=-127, high=127, size=shape).astype(dtype=np.int8)
+    qdata, min_range, max_range = get_test_data(real_range, qdata_np)
+    expected_result = baseline_dequantization(qdata, real_range, qdata_np)
+    # test nd array implementation.
+    test_nd_array_dequantization(qdata, min_range, max_range, expected_result)
+    # test symbolic api implementaion.
+    test_symbolic_api_dequantization(qdata, min_range, max_range, expected_result)
+
+
+def test_requantize_int32_to_int8():
+    def quantized_int32_to_float(qdata, min_range, max_range):
+        assert qdata.dtype == 'int32'
+        quantized_range = np.iinfo('int32').max
+        real_range = np.maximum(np.abs(min_range), np.abs(max_range))
+        scale = float(real_range) / float(quantized_range)
+        return qdata.astype('float32') * scale
+
+    def float_to_quantized_int8(data, min_range, max_range):
+        assert data.dtype == 'float32'
+        real_range = np.maximum(np.abs(min_range), np.abs(max_range))
+        quantized_range = np.iinfo('int8').max
+        scale = float(quantized_range) / float(real_range)
+        return (np.sign(data) * np.minimum(np.abs(data) * scale + 0.5, quantized_range)).astype('int8')
+
+    def requantize(qdata, min_data, max_data, real_range):
+        data = quantized_int32_to_float(qdata, min_data, max_data)
+        output = float_to_quantized_int8(data, -real_range, real_range)
+        return output, -real_range, real_range
+
+    def requantize_baseline(qdata, min_data, max_data, min_calib_range=None, max_calib_range=None):
+        if min_calib_range is not None and max_calib_range is not None:
+            real_range = np.maximum(np.abs(min_calib_range), np.abs(max_calib_range))
+            return requantize(qdata, min_data, max_data, real_range)
+        else:
+            min_range = quantized_int32_to_float(np.min(qdata), min_data, max_data)
+            max_range = quantized_int32_to_float(np.max(qdata), min_data, max_data)
+            return requantize(qdata, min_data, max_data, np.maximum(np.abs(min_range), np.abs(max_range)))
+
+    def check_requantize(shape, min_calib_range=None, max_calib_range=None):
+        qdata = mx.nd.random.uniform(low=-1000.0, high=1000.0, shape=shape).astype('int32')
+        min_range = mx.nd.array([-1010.0])
+        max_range = mx.nd.array([1020.0])
+        if min_calib_range is None or max_calib_range is None:
+            qdata_int8, min_output, max_output = mx.nd.contrib.requantize(qdata, min_range, max_range)
+        else:
+            qdata_int8, min_output, max_output = mx.nd.contrib.requantize(qdata, min_range, max_range,
+                                                                          min_calib_range=min_calib_range,
+                                                                          max_calib_range=max_calib_range)
+
+        qdata_int8_np, min_output_np, max_output_np = requantize_baseline(qdata.asnumpy(), min_range.asscalar(),
+                                                                          max_range.asscalar(),
+                                                                          min_calib_range=min_calib_range,
+                                                                          max_calib_range=max_calib_range)
+        assert_almost_equal(qdata_int8.asnumpy(), qdata_int8_np, atol = 1)
+        assert_almost_equal(min_output.asnumpy(), np.array([min_output_np]))
+        assert_almost_equal(max_output.asnumpy(), np.array([max_output_np]))
+
+    def check_requantize_with_symbol(shape, min_calib_range=None, max_calib_range=None):
+        qdata = mx.nd.random.uniform(low=-1000.0, high=1000.0, shape=shape).astype('int32')
+        min_range = mx.nd.array([-1010.0])
+        max_range = mx.nd.array([1020.0])
+        sym_data = mx.sym.Variable('data')
+        sym_min_range = mx.sym.Variable('min_range')
+        sym_max_range = mx.sym.Variable('max_range')
+        if min_calib_range is None or max_calib_range is None:
+            requant = mx.sym.contrib.requantize(sym_data, sym_min_range, sym_max_range)
+            out = requant._bind(ctx=mx.current_context(),
+                               args={'data':qdata, 'min_range':min_range,
+                               'max_range':max_range})
+            qdata_int8, min_output, max_output = out.forward()
+        else:
+            requant = mx.sym.contrib.requantize(sym_data, sym_min_range, sym_max_range,
+                                                min_calib_range=min_calib_range,
+                                                max_calib_range=max_calib_range)
+            out = requant._bind(ctx=mx.current_context(), args={'data':qdata, 'min_range':min_range,
+                               'max_range':max_range})
+            qdata_int8, min_output, max_output = out.forward()
+
+        qdata_int8_np, min_output_np, max_output_np = requantize_baseline(qdata.asnumpy(), min_range.asscalar(),
+                                                                          max_range.asscalar(),
+                                                                          min_calib_range=min_calib_range,
+                                                                          max_calib_range=max_calib_range)
+        assert_almost_equal(qdata_int8.asnumpy(), qdata_int8_np, atol = 1)
+        assert_almost_equal(min_output.asnumpy(), np.array([min_output_np]))
+        assert_almost_equal(max_output.asnumpy(), np.array([max_output_np]))
+
+    # test with symbol API.
+    check_requantize_with_symbol((3, 4, 10, 10))
+    check_requantize_with_symbol((32, 3, 23, 23))
+    check_requantize_with_symbol((3, 4, 10, 10), min_calib_range=-1050.0, max_calib_range=1040.0)
+    check_requantize_with_symbol((32, 3, 23, 23), min_calib_range=-134.349, max_calib_range=523.43)
+    # Test with nd array API
+    check_requantize((3, 4, 10, 10))
+    check_requantize((32, 3, 23, 23))
+    check_requantize((3, 4, 10, 10), min_calib_range=-1050.0, max_calib_range=1040.0)
+    check_requantize((32, 3, 23, 23), min_calib_range=-134.349, max_calib_range=523.43)
+
+
+def test_quantized_conv():
+    def check_quantized_conv(data_shape, kernel, num_filter, pad, stride, dilate, no_bias, qdtype):
+        if is_test_for_native_cpu():
+            print('skipped testing quantized_conv for native cpu since it is not supported yet')
+            return
+        elif is_test_for_mkldnn():
+            # (TODO)Xinyu: https://github.com/apache/incubator-mxnet/issues/16830
+            print('skipped testing quantized_conv for mkldnn cpu since it is a flaky case')
+            return
+        elif qdtype == 'uint8' and is_test_for_gpu():
+            print('skipped testing quantized_conv for gpu uint8 since it is not supported yet')
+            return
+        elif is_test_for_gpu() and len(data_shape) != 4:
+            print('skipped testing quantized_conv for gpu 5d layout since it is not supported yet')
+            return
+
+        # run fp32 conv
+        data = mx.sym.Variable(name='data', shape=data_shape, dtype='float32')
+        conv = mx.sym.Convolution(data=data, kernel=kernel, num_filter=num_filter, pad=pad, stride=stride,
+                                  dilate=dilate, no_bias=no_bias, cudnn_off=False, name='conv')
+        arg_shapes, _, _ = conv.infer_shape(data=data_shape)
+        arg_names = conv.list_arguments()
+        conv_exe_fp32 = conv._simple_bind(ctx=mx.current_context(), grad_req='null')
+        if qdtype == 'uint8':
+            data_low = 0.0
+            data_high = 127.0
+        else:
+            data_low = -127.0
+            data_high = 127.0
+        conv_exe_fp32.arg_dict[arg_names[0]][:] = mx.nd.random.uniform(low=data_low, high=data_high,
+                                                                       shape=data_shape).astype('int32')
+        conv_exe_fp32.arg_dict[arg_names[1]][:] = mx.nd.random.uniform(low=-127.0, high=127.0,
+                                                                       shape=arg_shapes[1]).astype('int32')
+        if not no_bias:
+            conv_exe_fp32.arg_dict[arg_names[2]][:] = mx.nd.random.uniform(low=-127.0, high=127.0,
+                                                                           shape=arg_shapes[2]).astype('int32')
+        output = conv_exe_fp32.forward()[0]
+
+        # run quantized conv
+        qdata = mx.sym.Variable(name='qdata', shape=data_shape, dtype=qdtype)
+        qweight = mx.sym.Variable(name='qweight', dtype='int8')
+        min_data = mx.sym.Variable(name='min_data')
+        max_data = mx.sym.Variable(name='max_data')
+        min_weight = mx.sym.Variable(name='min_weight')
+        max_weight = mx.sym.Variable(name='max_weight')
+        quantized_conv = mx.sym.contrib.quantized_conv(data=qdata, weight=qweight, min_data=min_data,
+                                                       max_data=max_data, min_weight=min_weight,
+                                                       max_weight=max_weight, kernel=kernel,
+                                                       num_filter=num_filter, pad=pad, stride=stride,
+                                                       dilate=dilate, no_bias=no_bias)
+        qarg_names = quantized_conv.list_arguments()
+        type_dict = None
+        if not no_bias:
+            type_dict = {qarg_names[2]: 'int8'}
+        conv_exe_int8 = quantized_conv._simple_bind(ctx=mx.current_context(), type_dict=type_dict, grad_req='null')
+        conv_exe_int8.arg_dict[qarg_names[0]][:] = conv_exe_fp32.arg_dict[arg_names[0]].astype(qdtype)
+        conv_exe_int8.arg_dict[qarg_names[1]][:] = conv_exe_fp32.arg_dict[arg_names[1]].astype('int8')
+        quantized_range = 127.0
+        if no_bias:
+            conv_exe_int8.arg_dict[qarg_names[2]][:] = -quantized_range
+            conv_exe_int8.arg_dict[qarg_names[3]][:] = quantized_range
+            conv_exe_int8.arg_dict[qarg_names[4]][:] = -quantized_range
+            conv_exe_int8.arg_dict[qarg_names[5]][:] = quantized_range
+        else:
+            conv_exe_int8.arg_dict[qarg_names[2]][:] = conv_exe_fp32.arg_dict[arg_names[2]].astype('int8')
+            conv_exe_int8.arg_dict[qarg_names[3]][:] = -quantized_range
+            conv_exe_int8.arg_dict[qarg_names[4]][:] = quantized_range
+            conv_exe_int8.arg_dict[qarg_names[5]][:] = -quantized_range
+            conv_exe_int8.arg_dict[qarg_names[6]][:] = quantized_range
+            conv_exe_int8.arg_dict[qarg_names[7]][:] = -quantized_range
+            conv_exe_int8.arg_dict[qarg_names[8]][:] = quantized_range
+        qoutput, min_range, max_range = conv_exe_int8.forward()
+
+        if no_bias:
+            assert_almost_equal(output.asnumpy(), qoutput.asnumpy(), atol = 1)
+        else:
+            # with adding bias, accuracy loss should not be greater than one
+            diff = mx.nd.abs(output - qoutput.astype(output.dtype))
+            cond = mx.nd.lesser(2, diff).sum().asscalar()
+            assert cond == 0
+
+    for qdtype in ['int8', 'uint8']:
+        check_quantized_conv((3, 4, 28, 28), (3, 3), 128, (1, 1), (1, 1), (1, 1), True, qdtype)
+        check_quantized_conv((3, 4, 28, 28), (3, 3), 128, (1, 1), (1, 1), (1, 1), False, qdtype)
+        check_quantized_conv((1, 3, 4, 28, 28), (1, 3, 3), 128, (1, 1, 1), (1, 1, 1), (1, 1, 1), False, qdtype)
+        check_quantized_conv((1, 3, 4, 28, 28), (1, 3, 3), 128, (1, 1, 1), (1, 1, 1), (1, 1, 1), True, qdtype)
+        check_quantized_conv((1, 3, 4, 28, 28), (1, 3, 3), 128, (1, 1, 1), (1, 1, 1), (2, 2, 2), False, qdtype)
+        check_quantized_conv((1, 3, 4, 28, 28), (1, 3, 3), 128, (1, 1, 1), (1, 1, 1), (2, 2, 2), True, qdtype)
+
+
+def test_quantized_elemwise_add():
+    def check_quantized_elemwise_add(data_shape, qtype):
+        if is_test_for_native_cpu():
+            print('skipped testing quantized_elemwise_add for native cpu since it is not supported yet')
+            return
+        elif qtype != 'uint8' and qtype != 'int8':
+            print('skipped testing quantized_elemwise_add for not supported data type')
+            return
+        elif is_test_for_gpu():
+            print('skipped testing quantized_elemwise_add for gpu since it is not supported yet')
+            return
+
+        dataA = mx.sym.Variable(name='dataA', shape=data_shape, dtype='float32')
+        dataB = mx.sym.Variable(name='dataB', shape=data_shape, dtype='float32')
+        elemwise_add_fp32 = mx.sym.elemwise_add(dataA, dataB)
+        arg_names = elemwise_add_fp32.list_arguments()
+        elemwise_add_fp32_exe = elemwise_add_fp32._simple_bind(ctx=mx.current_context(), grad_req='null')
+        if qtype == 'uint8':
+            data_low = 0.0
+            data_high = 255.0
+        else:
+            data_low = -127.0
+            data_high = 127.0
+
+        dataA_val = mx.nd.random.uniform(low=data_low, high=data_high, shape=data_shape).astype('int32')
+        dataB_val = mx.nd.random.uniform(low=data_low, high=data_high, shape=data_shape).astype('int32')
+        elemwise_add_fp32_exe.arg_dict[arg_names[0]][:] = dataA_val
+
+        elemwise_add_fp32_exe.arg_dict[arg_names[1]][:] = dataB_val
+
+        output = elemwise_add_fp32_exe.forward()[0]
+        print(output)
+        qdataA = mx.sym.Variable(name='qdataA', shape=data_shape, dtype=qtype)
+        qdataB = mx.sym.Variable(name='qdataB', shape=data_shape, dtype=qtype)
+        min_dataA = mx.sym.Variable(name='min_dataA', dtype='float32')
+        max_dataA = mx.sym.Variable(name='max_dataA', dtype='float32')
+        min_dataB = mx.sym.Variable(name='min_dataB', dtype='float32')
+        max_dataB = mx.sym.Variable(name='max_dataB', dtype='float32')
+        quantized_elemwise_add = mx.sym.contrib.quantized_elemwise_add(qdataA, qdataB, min_dataA, max_dataA, min_dataB, max_dataB)
+        elemwise_add_int8_exe = quantized_elemwise_add._simple_bind(ctx=mx.current_context(), grad_req='null')
+        qarg_names = quantized_elemwise_add.list_arguments()
+        elemwise_add_int8_exe.arg_dict[qarg_names[0]][:] = elemwise_add_fp32_exe.arg_dict[arg_names[0]].astype(qtype)
+        elemwise_add_int8_exe.arg_dict[qarg_names[1]][:] = elemwise_add_fp32_exe.arg_dict[arg_names[1]].astype(qtype)
+        quantized_range = 127.0
+        elemwise_add_int8_exe.arg_dict[qarg_names[2]][:] = data_low
+        elemwise_add_int8_exe.arg_dict[qarg_names[3]][:] = data_high
+        elemwise_add_int8_exe.arg_dict[qarg_names[4]][:] = data_low
+        elemwise_add_int8_exe.arg_dict[qarg_names[5]][:] = data_high
+        qoutput, min_range, max_range = elemwise_add_int8_exe.forward()
+        print(qoutput)
+        int8_rslt = qoutput.astype(output.dtype)*max_range/0x7fffffff
+        print(int8_rslt)

Review comment:
       Remove these `print` in the test cases?

##########
File path: example/quantization/README.md
##########
@@ -0,0 +1,184 @@
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements.  See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership.  The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License.  You may obtain a copy of the License at -->
+<!--- -->
+<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
+<!--- -->
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied.  See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+
+# Model Quantization with Calibration Examples
+
+This folder contains examples of quantizing a FP32 model with Intel® MKL-DNN to (U)INT8 model.

Review comment:
       As `MKL-DNN` is already renamed to `oneDNN`, it would be better to use oneDNN in the documents, what do you think?

##########
File path: tests/python/quantization/test_quantization.py
##########
@@ -0,0 +1,1210 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Some of the tests using CUDNN require a special GPU instruction called dp4a.
+Ref: http://images.nvidia.com/content/pdf/tesla/184457-Tesla-P4-Datasheet-NV-Final-Letter-Web.pdf
+"""
+import os
+import mxnet as mx
+import numpy as np
+from mxnet.gluon.model_zoo import vision
+from mxnet.test_utils import assert_almost_equal, assert_exception, rand_ndarray, rand_shape_nd, same, DummyIter
+from common import xfail_when_nonstandard_decimal_separator
+from mxnet.io import NDArrayIter
+import unittest
+import operator
+
+
+def initialize_block_params(block, initializer):
+    for name, param in block.collect_params('.*gamma|.*running_var|.*moving_var').items():
+        param.initialize(mx.init.Constant(1))
+    for name, param in block.collect_params('.*beta|.*bias|.*moving_mean|.*running_mean').items():
+        param.initialize(mx.init.Constant(0))
+    for name, param in block.collect_params('.*weight').items():
+        param.initialize(initializer)
+
+def collect_block_args_aux(block, sym):
+  arg_params, aux_params = dict(), dict()
+  for k, v in block.collect_params().items():
+    if k in sym.list_arguments():
+      arg_params[k]= v._reduce()
+    elif k in sym.list_auxiliary_states():
+      aux_params[k]= v._reduce()
+  return arg_params, aux_params
+
+def is_test_for_gpu():
+    return mx.current_context().device_type == 'gpu'
+
+
+def is_test_for_mkldnn():
+    return (mx.current_context().device_type == 'cpu'
+            and os.environ.get('ENABLE_MKLDNN_QUANTIZATION_TEST') == '1')
+
+
+def is_test_for_native_cpu():
+    return (mx.current_context().device_type == 'cpu'
+            and os.environ.get('ENABLE_MKLDNN_QUANTIZATION_TEST') == None)
+
+
+def test_quantize_float32_to_int8():
+    shape = rand_shape_nd(4)
+    data = rand_ndarray(shape, 'default', dtype='float32')
+    min_range = mx.nd.min(data)
+    max_range = mx.nd.max(data)
+    qdata, min_val, max_val = mx.nd.contrib.quantize(data, min_range, max_range, out_type='int8')
+    data_np = data.asnumpy()
+    min_range = min_range.asscalar()
+    max_range = max_range.asscalar()
+    real_range = np.maximum(np.abs(min_range), np.abs(max_range))
+    quantized_range = 127.0
+    scale = quantized_range / real_range
+    assert qdata.dtype == np.int8
+    assert min_val.dtype == np.float32
+    assert max_val.dtype == np.float32
+    assert same(min_val.asscalar(), -real_range)
+    assert same(max_val.asscalar(), real_range)
+    qdata_np = (np.sign(data_np) * np.minimum(np.abs(data_np) * scale + 0.5, quantized_range)).astype(np.int8)
+    assert_almost_equal(qdata.asnumpy(), qdata_np, atol = 1)
+
+
+def test_dequantize_int8_to_float32():
+
+    def get_test_data(real_range, qdata_np):
+        qdata = mx.nd.array(qdata_np, dtype=np.int8)
+        min_range = mx.nd.array([-real_range], dtype=np.float32)
+        max_range = mx.nd.array([real_range], dtype=np.float32)
+        return qdata, min_range, max_range
+
+    def baseline_dequantization(qdata, real_range, qdata_np):
+        quantized_range = 127.0
+        scale = real_range / quantized_range
+        data_np = qdata_np * scale
+        return data_np
+
+    def test_nd_array_dequantization(qdata, min_range, max_range, expected_result):
+        data = mx.nd.contrib.dequantize(qdata, min_range, max_range, out_type='float32')
+        assert data.dtype == np.float32
+        assert_almost_equal(data.asnumpy(), expected_result, atol = 1)
+
+    def test_symbolic_api_dequantization(qdata, min_range, max_range, expected_result):
+        sym_data = mx.sym.Variable('data')
+        sym_min_range = mx.sym.Variable('min_range')
+        sym_max_range = mx.sym.Variable('max_range')
+        dequant = mx.sym.contrib.dequantize(sym_data, sym_min_range,
+                                            sym_max_range, out_type='float32')
+        out = dequant._bind(ctx=mx.current_context(),
+                           args={'data':qdata, 'min_range':min_range, 'max_range':max_range})
+        data = out.forward()[0]
+        assert data.dtype == np.float32
+        assert_almost_equal(data.asnumpy(), expected_result, atol = 1)
+
+    real_range = 128
+    shape = rand_shape_nd(4)
+    qdata_np = np.random.uniform(low=-127, high=127, size=shape).astype(dtype=np.int8)
+    qdata, min_range, max_range = get_test_data(real_range, qdata_np)
+    expected_result = baseline_dequantization(qdata, real_range, qdata_np)
+    # test nd array implementation.
+    test_nd_array_dequantization(qdata, min_range, max_range, expected_result)
+    # test symbolic api implementaion.
+    test_symbolic_api_dequantization(qdata, min_range, max_range, expected_result)
+
+
+def test_requantize_int32_to_int8():
+    def quantized_int32_to_float(qdata, min_range, max_range):
+        assert qdata.dtype == 'int32'
+        quantized_range = np.iinfo('int32').max
+        real_range = np.maximum(np.abs(min_range), np.abs(max_range))
+        scale = float(real_range) / float(quantized_range)
+        return qdata.astype('float32') * scale
+
+    def float_to_quantized_int8(data, min_range, max_range):
+        assert data.dtype == 'float32'
+        real_range = np.maximum(np.abs(min_range), np.abs(max_range))
+        quantized_range = np.iinfo('int8').max
+        scale = float(quantized_range) / float(real_range)
+        return (np.sign(data) * np.minimum(np.abs(data) * scale + 0.5, quantized_range)).astype('int8')
+
+    def requantize(qdata, min_data, max_data, real_range):
+        data = quantized_int32_to_float(qdata, min_data, max_data)
+        output = float_to_quantized_int8(data, -real_range, real_range)
+        return output, -real_range, real_range
+
+    def requantize_baseline(qdata, min_data, max_data, min_calib_range=None, max_calib_range=None):
+        if min_calib_range is not None and max_calib_range is not None:
+            real_range = np.maximum(np.abs(min_calib_range), np.abs(max_calib_range))
+            return requantize(qdata, min_data, max_data, real_range)
+        else:
+            min_range = quantized_int32_to_float(np.min(qdata), min_data, max_data)
+            max_range = quantized_int32_to_float(np.max(qdata), min_data, max_data)
+            return requantize(qdata, min_data, max_data, np.maximum(np.abs(min_range), np.abs(max_range)))
+
+    def check_requantize(shape, min_calib_range=None, max_calib_range=None):
+        qdata = mx.nd.random.uniform(low=-1000.0, high=1000.0, shape=shape).astype('int32')
+        min_range = mx.nd.array([-1010.0])
+        max_range = mx.nd.array([1020.0])
+        if min_calib_range is None or max_calib_range is None:
+            qdata_int8, min_output, max_output = mx.nd.contrib.requantize(qdata, min_range, max_range)
+        else:
+            qdata_int8, min_output, max_output = mx.nd.contrib.requantize(qdata, min_range, max_range,
+                                                                          min_calib_range=min_calib_range,
+                                                                          max_calib_range=max_calib_range)
+
+        qdata_int8_np, min_output_np, max_output_np = requantize_baseline(qdata.asnumpy(), min_range.asscalar(),
+                                                                          max_range.asscalar(),
+                                                                          min_calib_range=min_calib_range,
+                                                                          max_calib_range=max_calib_range)
+        assert_almost_equal(qdata_int8.asnumpy(), qdata_int8_np, atol = 1)
+        assert_almost_equal(min_output.asnumpy(), np.array([min_output_np]))
+        assert_almost_equal(max_output.asnumpy(), np.array([max_output_np]))
+
+    def check_requantize_with_symbol(shape, min_calib_range=None, max_calib_range=None):
+        qdata = mx.nd.random.uniform(low=-1000.0, high=1000.0, shape=shape).astype('int32')
+        min_range = mx.nd.array([-1010.0])
+        max_range = mx.nd.array([1020.0])
+        sym_data = mx.sym.Variable('data')
+        sym_min_range = mx.sym.Variable('min_range')
+        sym_max_range = mx.sym.Variable('max_range')
+        if min_calib_range is None or max_calib_range is None:
+            requant = mx.sym.contrib.requantize(sym_data, sym_min_range, sym_max_range)
+            out = requant._bind(ctx=mx.current_context(),
+                               args={'data':qdata, 'min_range':min_range,
+                               'max_range':max_range})
+            qdata_int8, min_output, max_output = out.forward()
+        else:
+            requant = mx.sym.contrib.requantize(sym_data, sym_min_range, sym_max_range,
+                                                min_calib_range=min_calib_range,
+                                                max_calib_range=max_calib_range)
+            out = requant._bind(ctx=mx.current_context(), args={'data':qdata, 'min_range':min_range,
+                               'max_range':max_range})
+            qdata_int8, min_output, max_output = out.forward()
+
+        qdata_int8_np, min_output_np, max_output_np = requantize_baseline(qdata.asnumpy(), min_range.asscalar(),
+                                                                          max_range.asscalar(),
+                                                                          min_calib_range=min_calib_range,
+                                                                          max_calib_range=max_calib_range)
+        assert_almost_equal(qdata_int8.asnumpy(), qdata_int8_np, atol = 1)
+        assert_almost_equal(min_output.asnumpy(), np.array([min_output_np]))
+        assert_almost_equal(max_output.asnumpy(), np.array([max_output_np]))
+
+    # test with symbol API.
+    check_requantize_with_symbol((3, 4, 10, 10))
+    check_requantize_with_symbol((32, 3, 23, 23))
+    check_requantize_with_symbol((3, 4, 10, 10), min_calib_range=-1050.0, max_calib_range=1040.0)
+    check_requantize_with_symbol((32, 3, 23, 23), min_calib_range=-134.349, max_calib_range=523.43)
+    # Test with nd array API
+    check_requantize((3, 4, 10, 10))
+    check_requantize((32, 3, 23, 23))
+    check_requantize((3, 4, 10, 10), min_calib_range=-1050.0, max_calib_range=1040.0)
+    check_requantize((32, 3, 23, 23), min_calib_range=-134.349, max_calib_range=523.43)
+
+
+def test_quantized_conv():
+    def check_quantized_conv(data_shape, kernel, num_filter, pad, stride, dilate, no_bias, qdtype):
+        if is_test_for_native_cpu():
+            print('skipped testing quantized_conv for native cpu since it is not supported yet')
+            return
+        elif is_test_for_mkldnn():
+            # (TODO)Xinyu: https://github.com/apache/incubator-mxnet/issues/16830
+            print('skipped testing quantized_conv for mkldnn cpu since it is a flaky case')
+            return
+        elif qdtype == 'uint8' and is_test_for_gpu():
+            print('skipped testing quantized_conv for gpu uint8 since it is not supported yet')
+            return
+        elif is_test_for_gpu() and len(data_shape) != 4:
+            print('skipped testing quantized_conv for gpu 5d layout since it is not supported yet')
+            return
+
+        # run fp32 conv
+        data = mx.sym.Variable(name='data', shape=data_shape, dtype='float32')
+        conv = mx.sym.Convolution(data=data, kernel=kernel, num_filter=num_filter, pad=pad, stride=stride,
+                                  dilate=dilate, no_bias=no_bias, cudnn_off=False, name='conv')
+        arg_shapes, _, _ = conv.infer_shape(data=data_shape)
+        arg_names = conv.list_arguments()
+        conv_exe_fp32 = conv._simple_bind(ctx=mx.current_context(), grad_req='null')
+        if qdtype == 'uint8':
+            data_low = 0.0
+            data_high = 127.0
+        else:
+            data_low = -127.0
+            data_high = 127.0
+        conv_exe_fp32.arg_dict[arg_names[0]][:] = mx.nd.random.uniform(low=data_low, high=data_high,
+                                                                       shape=data_shape).astype('int32')
+        conv_exe_fp32.arg_dict[arg_names[1]][:] = mx.nd.random.uniform(low=-127.0, high=127.0,
+                                                                       shape=arg_shapes[1]).astype('int32')
+        if not no_bias:
+            conv_exe_fp32.arg_dict[arg_names[2]][:] = mx.nd.random.uniform(low=-127.0, high=127.0,
+                                                                           shape=arg_shapes[2]).astype('int32')
+        output = conv_exe_fp32.forward()[0]
+
+        # run quantized conv
+        qdata = mx.sym.Variable(name='qdata', shape=data_shape, dtype=qdtype)
+        qweight = mx.sym.Variable(name='qweight', dtype='int8')
+        min_data = mx.sym.Variable(name='min_data')
+        max_data = mx.sym.Variable(name='max_data')
+        min_weight = mx.sym.Variable(name='min_weight')
+        max_weight = mx.sym.Variable(name='max_weight')
+        quantized_conv = mx.sym.contrib.quantized_conv(data=qdata, weight=qweight, min_data=min_data,
+                                                       max_data=max_data, min_weight=min_weight,
+                                                       max_weight=max_weight, kernel=kernel,
+                                                       num_filter=num_filter, pad=pad, stride=stride,
+                                                       dilate=dilate, no_bias=no_bias)
+        qarg_names = quantized_conv.list_arguments()
+        type_dict = None
+        if not no_bias:
+            type_dict = {qarg_names[2]: 'int8'}
+        conv_exe_int8 = quantized_conv._simple_bind(ctx=mx.current_context(), type_dict=type_dict, grad_req='null')
+        conv_exe_int8.arg_dict[qarg_names[0]][:] = conv_exe_fp32.arg_dict[arg_names[0]].astype(qdtype)
+        conv_exe_int8.arg_dict[qarg_names[1]][:] = conv_exe_fp32.arg_dict[arg_names[1]].astype('int8')
+        quantized_range = 127.0
+        if no_bias:
+            conv_exe_int8.arg_dict[qarg_names[2]][:] = -quantized_range
+            conv_exe_int8.arg_dict[qarg_names[3]][:] = quantized_range
+            conv_exe_int8.arg_dict[qarg_names[4]][:] = -quantized_range
+            conv_exe_int8.arg_dict[qarg_names[5]][:] = quantized_range
+        else:
+            conv_exe_int8.arg_dict[qarg_names[2]][:] = conv_exe_fp32.arg_dict[arg_names[2]].astype('int8')
+            conv_exe_int8.arg_dict[qarg_names[3]][:] = -quantized_range
+            conv_exe_int8.arg_dict[qarg_names[4]][:] = quantized_range
+            conv_exe_int8.arg_dict[qarg_names[5]][:] = -quantized_range
+            conv_exe_int8.arg_dict[qarg_names[6]][:] = quantized_range
+            conv_exe_int8.arg_dict[qarg_names[7]][:] = -quantized_range
+            conv_exe_int8.arg_dict[qarg_names[8]][:] = quantized_range
+        qoutput, min_range, max_range = conv_exe_int8.forward()
+
+        if no_bias:
+            assert_almost_equal(output.asnumpy(), qoutput.asnumpy(), atol = 1)
+        else:
+            # with adding bias, accuracy loss should not be greater than one
+            diff = mx.nd.abs(output - qoutput.astype(output.dtype))
+            cond = mx.nd.lesser(2, diff).sum().asscalar()
+            assert cond == 0
+
+    for qdtype in ['int8', 'uint8']:
+        check_quantized_conv((3, 4, 28, 28), (3, 3), 128, (1, 1), (1, 1), (1, 1), True, qdtype)
+        check_quantized_conv((3, 4, 28, 28), (3, 3), 128, (1, 1), (1, 1), (1, 1), False, qdtype)
+        check_quantized_conv((1, 3, 4, 28, 28), (1, 3, 3), 128, (1, 1, 1), (1, 1, 1), (1, 1, 1), False, qdtype)
+        check_quantized_conv((1, 3, 4, 28, 28), (1, 3, 3), 128, (1, 1, 1), (1, 1, 1), (1, 1, 1), True, qdtype)
+        check_quantized_conv((1, 3, 4, 28, 28), (1, 3, 3), 128, (1, 1, 1), (1, 1, 1), (2, 2, 2), False, qdtype)
+        check_quantized_conv((1, 3, 4, 28, 28), (1, 3, 3), 128, (1, 1, 1), (1, 1, 1), (2, 2, 2), True, qdtype)
+
+
+def test_quantized_elemwise_add():
+    def check_quantized_elemwise_add(data_shape, qtype):
+        if is_test_for_native_cpu():
+            print('skipped testing quantized_elemwise_add for native cpu since it is not supported yet')
+            return
+        elif qtype != 'uint8' and qtype != 'int8':
+            print('skipped testing quantized_elemwise_add for not supported data type')
+            return
+        elif is_test_for_gpu():
+            print('skipped testing quantized_elemwise_add for gpu since it is not supported yet')
+            return
+
+        dataA = mx.sym.Variable(name='dataA', shape=data_shape, dtype='float32')
+        dataB = mx.sym.Variable(name='dataB', shape=data_shape, dtype='float32')
+        elemwise_add_fp32 = mx.sym.elemwise_add(dataA, dataB)
+        arg_names = elemwise_add_fp32.list_arguments()
+        elemwise_add_fp32_exe = elemwise_add_fp32._simple_bind(ctx=mx.current_context(), grad_req='null')
+        if qtype == 'uint8':
+            data_low = 0.0
+            data_high = 255.0
+        else:
+            data_low = -127.0
+            data_high = 127.0
+
+        dataA_val = mx.nd.random.uniform(low=data_low, high=data_high, shape=data_shape).astype('int32')
+        dataB_val = mx.nd.random.uniform(low=data_low, high=data_high, shape=data_shape).astype('int32')
+        elemwise_add_fp32_exe.arg_dict[arg_names[0]][:] = dataA_val
+
+        elemwise_add_fp32_exe.arg_dict[arg_names[1]][:] = dataB_val
+
+        output = elemwise_add_fp32_exe.forward()[0]
+        print(output)
+        qdataA = mx.sym.Variable(name='qdataA', shape=data_shape, dtype=qtype)
+        qdataB = mx.sym.Variable(name='qdataB', shape=data_shape, dtype=qtype)
+        min_dataA = mx.sym.Variable(name='min_dataA', dtype='float32')
+        max_dataA = mx.sym.Variable(name='max_dataA', dtype='float32')
+        min_dataB = mx.sym.Variable(name='min_dataB', dtype='float32')
+        max_dataB = mx.sym.Variable(name='max_dataB', dtype='float32')
+        quantized_elemwise_add = mx.sym.contrib.quantized_elemwise_add(qdataA, qdataB, min_dataA, max_dataA, min_dataB, max_dataB)
+        elemwise_add_int8_exe = quantized_elemwise_add._simple_bind(ctx=mx.current_context(), grad_req='null')
+        qarg_names = quantized_elemwise_add.list_arguments()
+        elemwise_add_int8_exe.arg_dict[qarg_names[0]][:] = elemwise_add_fp32_exe.arg_dict[arg_names[0]].astype(qtype)
+        elemwise_add_int8_exe.arg_dict[qarg_names[1]][:] = elemwise_add_fp32_exe.arg_dict[arg_names[1]].astype(qtype)
+        quantized_range = 127.0
+        elemwise_add_int8_exe.arg_dict[qarg_names[2]][:] = data_low
+        elemwise_add_int8_exe.arg_dict[qarg_names[3]][:] = data_high
+        elemwise_add_int8_exe.arg_dict[qarg_names[4]][:] = data_low
+        elemwise_add_int8_exe.arg_dict[qarg_names[5]][:] = data_high
+        qoutput, min_range, max_range = elemwise_add_int8_exe.forward()
+        print(qoutput)
+        int8_rslt = qoutput.astype(output.dtype)*max_range/0x7fffffff
+        print(int8_rslt)
+        diff = mx.nd.abs(output - int8_rslt)
+        cond = mx.nd.lesser(2, diff).sum().asscalar()
+        assert cond == 0
+
+    for qtype in ['int8', 'uint8']:
+        check_quantized_elemwise_add((4, 6), qtype)
+        # check_quantized_elemwise_add((13, 74, 52), qtype)

Review comment:
       Any reason for comment out these cases?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org