You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/06/25 05:01:08 UTC

[GitHub] reminisce commented on a change in pull request #11325: Added TensorRT runtime integration

reminisce commented on a change in pull request #11325: Added TensorRT runtime integration
URL: https://github.com/apache/incubator-mxnet/pull/11325#discussion_r197672107
 
 

 ##########
 File path: docs/api/python/contrib/tensorrt.md
 ##########
 @@ -0,0 +1,117 @@
+# MxNet-TensorRT Runtime Integration
+## What is this?
+
+This document described how to use the [MxNet](http://mxnet.incubator.apache.org/)-[TensorRT](https://developer.nvidia.com/tensorrt) runtime integration to accelerate model inference.
+
+## Why is TensorRT integration useful? 
+
+TensorRT can greatly speed up inference of deep learning models. One experiment on a Titan V (V100) GPU shows that with MxNet 1.2, we can get an approximately 3x speed-up when running inference of the ResNet-50 model on the CIFAR-10 dataset in single precision (fp32). As batch sizes and image sizes go up (for CNN inference), the benefit may be less, but in general, TensorRT helps especially in cases which have:
+- many bandwidth-bound layers (e.g. pointwise operations) that benefit from GPU kernel fusion
+- inference use cases which have tight latency requirements and where the client application can't wait for large batches to be queued up
+- embedded systems, where memory constraints are tighter than on servers
+- when performing inference in reduced precision, especially for integer (e.g. int8) inference. 
+
+In the past, the main hindrance for the user wishing to benefit from TensorRT was the fact that the model needed to be exported from the framework first. Once the model got exported through some means (NNVM to TensorRT graph rewrite, via ONNX, etc.), one had to then write a TensorRT client application, which would feed the data into the TensorRT engine. Since at that point the model was independent of the original framework, and since TensorRT could only compute the neural network layers but the user had to bring their own data pipeline, this increased the burden on the user and reduced the likelihood of reproducibility (e.g. different frameworks may have slightly different data pipelines, or flexibility of data pipeline operation ordering). Moreover, since frameworks typically support more operators than TensorRT, one could have to resort to TensorRT plugins for operations that aren't already available via the TensorRT graph API.  
+
+The current experimental runtime integration of TensorRT with MxNet resolves the above concerns by ensuring that:
+- the graph is still executed by MxNet
+- the MxNet data pipeline is preserved
+- the TensorRT runtime integration logic partitions the graph into subgraphs that are either TensorRT compatible or incompatible
+- the graph partitioner collects the TensorRT-compatible subgraphs, hands them over to TensorRT, and substitutes the TensorRT compatible subgraph with a TensorRT library call, represented as a TensorRT node in NNVM.
+- if a node is not TensorRT compatible, it won't be extracted and substituted with a TensorRT call, and will still execute within MxNet
+
+The above points ensure that we find a compromise between the flexibility of MxNet, and fast inference in TensorRT, without putting a burden on the user to learn how TensorRT APIs work, without the need to write one's own client application and data pipeline, etc.
+
+## How do I build MxNet with TensorRT integration?
+
+Building MxNet together with TensorRT is somewhat complex. The recipe will hopefully be simplified in the near future, but for now, it's easiest to build a Docker container with a Ubuntu 16.04 base. This Dockerfile can be found under the ci subdirectory of the MxNet repository. You can build the container as follows:
+
+```
+docker build -t ci/docker/Dockerfile.build.ubuntu_gpu_tensorrt mxnet_with_tensorrt
+```
+
+Next, we can run this container as follows (don't forget to install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker)):
+
+```no-highlight
+nvidia-docker run -ti --rm mxnet_with_tensorrt
+```
+
+After starting the container, you will find yourself in the /opt/mxnet directory by default.
+
+## Running a "hello, world" model / unit test:
+
+You can then run the LeNet-5 unit test, which will train LeNet-5 on MNIST, and subsequently run inference in MxNet, as well as using the MxNet-TensorRT runtime integration, and compare the results. The test can be run as follows:
+
+```no-highlight
+python tests/python/tensorrt/test_tensorrt_lenet5.py
+```
+
+You should get a result similar to the following:
+
+```no-highlight
+Running inference in MxNet
+[03:31:18] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
+Running inference in MxNet-TensorRT
+[03:31:18] src/operator/contrib/nnvm_to_onnx.cc:152: ONNX graph construction complete.
+Building TensorRT engine, FP16 available:1
+    Max batch size:     1024
+    Max workspace size: 1024 MiB
+[03:31:18] src/operator/contrib/tensorrt.cc:85: TensorRT engine instantiated!!!
+MxNet accuracy: 98.680000
+MxNet-TensorRT accuracy: 98.680000
+```
+
+## Runing a more complex model
+
+To show that the runtime integration handles more complex models such as ResNet-50 (which includes batch normalization as well as skip connections), the relevant script is included in the `example/image_classification/tensorrt` directory.
+
+## Building your own models
+
+When building your own models, feel free to use the above ResNet-50 model as an example. Here, we highlight a small number of issues that need to be taken into account.
+
+1. When loading a pre-trained model, the inference will be handled using the Symbol API, rather than the Module API.
+2. In order to provide the weights to the MxNet (NNVM) to TensorRT graph converter befor the symbol is fully bound (before the memory is allocated, etc.), the `arg_params` and `aux_params` need to be provided to the symbol's `simple_bind` method. The weights and other values (e.g. moments learned from data by batch normalization, provided via `aux_params`) will be provided via the `shared_buffer` argument to `simple_bind` as follows:
 
 Review comment:
   moments -> momenta

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services