You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "rszper (via GitHub)" <gi...@apache.org> on 2023/01/31 18:38:29 UTC
[GitHub] [beam] rszper commented on a diff in pull request #25226: Add TensorRT runinference example for Text Classification

rszper commented on code in PR #25226:
URL: https://github.com/apache/beam/pull/25226#discussion_r1092283439


##########
website/www/site/content/en/documentation/ml/overview.md:
##########
@@ -91,4 +91,5 @@ You can find examples of end-to-end AI/ML pipelines for several use cases:
 * [Online Clustering in Beam](/documentation/ml/online-clustering): Demonstrates how to set up a real-time clustering pipeline that can read text from Pub/Sub, convert the text into an embedding using a transformer-based language model with the RunInference API, and cluster the text using BIRCH with stateful processing.
 * [Anomaly Detection in Beam](/documentation/ml/anomaly-detection): Demonstrates how to set up an anomaly detection pipeline that reads text from Pub/Sub in real time and then detects anomalies using a trained HDBSCAN clustering model with the RunInference API.
 * [Large Language Model Inference in Beam](/documentation/ml/large-language-modeling): Demonstrates a pipeline that uses RunInference to perform translation with the T5 language model which contains 11 billion parameters.
-* [Per Entity Training in Beam](/documentation/ml/per-entity-training): Demonstrates a pipeline that trains a Decision Tree Classifier per education level for predicting if the salary of a person is >= 50k.
\ No newline at end of file
+* [Per Entity Training in Beam](/documentation/ml/per-entity-training): Demonstrates a pipeline that trains a Decision Tree Classifier per education level for predicting if the salary of a person is >= 50k.
+* [TensorRT Text Classification Inference](/documentation/ml/tensorrt-runinference): Demonstrates a pipeline to utilize TensorRT with the RunInference using a BERT-based text classification model.

Review Comment:
   ```suggestion
   * [TensorRT Text Classification Inference](/documentation/ml/tensorrt-runinference): Demonstrates a pipeline that uses TensorRT with the RunInference transform and a BERT-based text classification model.
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,

Review Comment:
   ```suggestion
   for a text classification model. This pipeline reads in-memory data,
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,
+does some preprocessing and then uses RunInference for getting prediction
+from the text classification TensorRT engine. Afterwards, it post process

Review Comment:
   ```suggestion
   from the text classification TensorRT engine. Next, it postprocesses
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.

Review Comment:
   ```suggestion
   - [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network to efficiently run inference on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,
+does some preprocessing and then uses RunInference for getting prediction
+from the text classification TensorRT engine. Afterwards, it post process
+the RunInference outputs to print the input and the predicted class label.
+It also prints different metrics provided by RunInference.
+"""
+
+import argparse
+import logging
+
+import numpy as np
+
+import apache_beam as beam
+from apache_beam.ml.inference.base import RunInference
+from apache_beam.ml.inference.tensorrt_inference import TensorRTEngineHandlerNumPy
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.options.pipeline_options import SetupOptions
+from transformers import AutoTokenizer
+
+
+class Preprocess(beam.DoFn):
+  """Processes the input sentence to tokenize them.

Review Comment:
   ```suggestion
     """Processes the input sentences to tokenize them.
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,
+does some preprocessing and then uses RunInference for getting prediction
+from the text classification TensorRT engine. Afterwards, it post process
+the RunInference outputs to print the input and the predicted class label.
+It also prints different metrics provided by RunInference.

Review Comment:
   ```suggestion
   It also prints metrics provided by RunInference.
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file from a trained model. We take a trained BERT based text classification model that does sentiment analysis, i.e. classifies any text into two classes: positive or negative. The trained model is easily available [here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to convert the PyTorch Model to TensorRT engine, you need to first convert the model to ONNX and then from ONNX to TensorRT.

Review Comment:
   ```suggestion
   To use TensorRT with Apache Beam, you need a converted TensorRT engine file from a trained model. We take a trained BERT based text classification model that does sentiment analysis, that is, it classifies any text into two classes: positive or negative. The trained model is available [from Hugging Face](https://huggingface.co/textattack/bert-base-uncased-SST-2). To convert the PyTorch Model to TensorRT engine, you need to first convert the model to ONNX and then from ONNX to TensorRT.
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file from a trained model. We take a trained BERT based text classification model that does sentiment analysis, i.e. classifies any text into two classes: positive or negative. The trained model is easily available [here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to convert the PyTorch Model to TensorRT engine, you need to first convert the model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch model to ONNX. A detailed blogpost can be found [here](https://huggingface.co/blog/convert-transformers-to-onnx) which also mentions the required packages to install. The code that we used for the conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the following command from `CLI`:

Review Comment:
   ```suggestion
   To convert an ONNX model to a TensorRT engine, use the following command from the `CLI`:
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file from a trained model. We take a trained BERT based text classification model that does sentiment analysis, i.e. classifies any text into two classes: positive or negative. The trained model is easily available [here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to convert the PyTorch Model to TensorRT engine, you need to first convert the model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch model to ONNX. A detailed blogpost can be found [here](https://huggingface.co/blog/convert-transformers-to-onnx) which also mentions the required packages to install. The code that we used for the conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this [blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/) which builds a docker image built from a DockerFile. The dockerFile that we used is similar to it and can be found below:

Review Comment:
   ```suggestion
   To use `trtexec`, follow the steps in the blog post [Simplifying and Accelerating Machine Learning Predictions in Apache Beam with NVIDIA TensorRT](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/). The post explains how to build a docker image from a Docker file. We use the following Docker file, which is similar to the file used in the blog post:
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference

Review Comment:
   ```suggestion
   # Use TensorRT with RunInference
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file from a trained model. We take a trained BERT based text classification model that does sentiment analysis, i.e. classifies any text into two classes: positive or negative. The trained model is easily available [here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to convert the PyTorch Model to TensorRT engine, you need to first convert the model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch model to ONNX. A detailed blogpost can be found [here](https://huggingface.co/blog/convert-transformers-to-onnx) which also mentions the required packages to install. The code that we used for the conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this [blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/) which builds a docker image built from a DockerFile. The dockerFile that we used is similar to it and can be found below:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository universe && \
+    apt-get update && \
+    apt-get install -y python3.8-venv
+
+RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0
+COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blogpost also contains the instructions on how to test the TensorRT engine locally.
+
+
+## Running TensorRT Engine with RunInference in a Beam Pipeline

Review Comment:
   ```suggestion
   ## Run TensorRT engine with RunInference in a Beam pipeline
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file from a trained model. We take a trained BERT based text classification model that does sentiment analysis, i.e. classifies any text into two classes: positive or negative. The trained model is easily available [here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to convert the PyTorch Model to TensorRT engine, you need to first convert the model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch model to ONNX. A detailed blogpost can be found [here](https://huggingface.co/blog/convert-transformers-to-onnx) which also mentions the required packages to install. The code that we used for the conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this [blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/) which builds a docker image built from a DockerFile. The dockerFile that we used is similar to it and can be found below:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository universe && \
+    apt-get update && \
+    apt-get install -y python3.8-venv
+
+RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0
+COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blogpost also contains the instructions on how to test the TensorRT engine locally.
+
+
+## Running TensorRT Engine with RunInference in a Beam Pipeline
+
+Now that you have the TensorRT engine, you can use TensorRT engine with RunInferece in a Beam pipeline that can be run both locally and on GCP.

Review Comment:
   ```suggestion
   Now that you have the TensorRT engine, you can use TensorRT engine with RunInference in a Beam pipeline that can run both locally and on Google Cloud.
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,
+does some preprocessing and then uses RunInference for getting prediction

Review Comment:
   ```suggestion
   preprocesses the data, and then uses RunInference for to generate predictions
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file from a trained model. We take a trained BERT based text classification model that does sentiment analysis, i.e. classifies any text into two classes: positive or negative. The trained model is easily available [here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to convert the PyTorch Model to TensorRT engine, you need to first convert the model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch model to ONNX. A detailed blogpost can be found [here](https://huggingface.co/blog/convert-transformers-to-onnx) which also mentions the required packages to install. The code that we used for the conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this [blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/) which builds a docker image built from a DockerFile. The dockerFile that we used is similar to it and can be found below:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository universe && \
+    apt-get update && \
+    apt-get install -y python3.8-venv
+
+RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0
+COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blogpost also contains the instructions on how to test the TensorRT engine locally.
+
+
+## Running TensorRT Engine with RunInference in a Beam Pipeline
+
+Now that you have the TensorRT engine, you can use TensorRT engine with RunInferece in a Beam pipeline that can be run both locally and on GCP.
+
+The following code example is a part of the pipeline, where you use `TensorRTEngineHandlerNumPy` to load the TensorRT engine and set other inference parameters.
+
+```
+  model_handler = TensorRTEngineHandlerNumPy(
+      min_batch_size=1,
+      max_batch_size=1,
+      engine_path=known_args.trt_model_path,
+  )
+
+  task_sentences = [
+      "Hello, my dog is cute",
+      "I hate you",
+      "Shubham Krishna is a good coder",
+  ] * 4000
+
+  tokenizer = AutoTokenizer.from_pretrained(known_args.model_id)
+
+  with beam.Pipeline(options=pipeline_options) as pipeline:
+    _ = (
+        pipeline
+        | "CreateInputs" >> beam.Create(task_sentences)
+        | "Preprocess" >> beam.ParDo(Preprocess(tokenizer=tokenizer))
+        | "RunInference" >> RunInference(model_handler=model_handler)
+        | "PostProcess" >> beam.ParDo(Postprocess(tokenizer=tokenizer)))
+```
+
+The full code can be found [here]().

Review Comment:
   Link text (currently [here]) should be the title of the page being linked to. If it's code on GitHub, use [on GitHub].



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.

Review Comment:
   ```suggestion
   The following example demonstrates how to use TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,
+does some preprocessing and then uses RunInference for getting prediction
+from the text classification TensorRT engine. Afterwards, it post process
+the RunInference outputs to print the input and the predicted class label.
+It also prints different metrics provided by RunInference.
+"""
+
+import argparse
+import logging
+
+import numpy as np
+
+import apache_beam as beam
+from apache_beam.ml.inference.base import RunInference
+from apache_beam.ml.inference.tensorrt_inference import TensorRTEngineHandlerNumPy
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.options.pipeline_options import SetupOptions
+from transformers import AutoTokenizer
+
+
+class Preprocess(beam.DoFn):
+  """Processes the input sentence to tokenize them.
+
+  The input sentences are tokenized as the

Review Comment:
   ```suggestion
     The input sentences are tokenized because the
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file from a trained model. We take a trained BERT based text classification model that does sentiment analysis, i.e. classifies any text into two classes: positive or negative. The trained model is easily available [here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to convert the PyTorch Model to TensorRT engine, you need to first convert the model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch model to ONNX. A detailed blogpost can be found [here](https://huggingface.co/blog/convert-transformers-to-onnx) which also mentions the required packages to install. The code that we used for the conversion can be found below.

Review Comment:
   ```suggestion
   You can use the Hugging Face `transformers` library to convert a PyTorch model to ONNX. For details, see the blog post [Convert Transformers to ONNX with Hugging Face Optimum](https://huggingface.co/blog/convert-transformers-to-onnx). The blog post explains which required packages to install. The following code is used for the conversion.
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file from a trained model. We take a trained BERT based text classification model that does sentiment analysis, i.e. classifies any text into two classes: positive or negative. The trained model is easily available [here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to convert the PyTorch Model to TensorRT engine, you need to first convert the model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch model to ONNX. A detailed blogpost can be found [here](https://huggingface.co/blog/convert-transformers-to-onnx) which also mentions the required packages to install. The code that we used for the conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this [blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/) which builds a docker image built from a DockerFile. The dockerFile that we used is similar to it and can be found below:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository universe && \
+    apt-get update && \
+    apt-get install -y python3.8-venv
+
+RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0
+COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blogpost also contains the instructions on how to test the TensorRT engine locally.

Review Comment:
   ```suggestion
   The blog post also contains instructions explaining how to test the TensorRT engine locally.
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
+
+# Build a TensorRT engine for inference

Review Comment:
   ```suggestion
   ## Build a TensorRT engine for inference
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in production pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with the RunInference API using a BERT-based text classification model in a Beam pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file from a trained model. We take a trained BERT based text classification model that does sentiment analysis, i.e. classifies any text into two classes: positive or negative. The trained model is easily available [here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to convert the PyTorch Model to TensorRT engine, you need to first convert the model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch model to ONNX. A detailed blogpost can be found [here](https://huggingface.co/blog/convert-transformers-to-onnx) which also mentions the required packages to install. The code that we used for the conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this [blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/) which builds a docker image built from a DockerFile. The dockerFile that we used is similar to it and can be found below:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository universe && \
+    apt-get update && \
+    apt-get install -y python3.8-venv
+
+RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0
+COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blogpost also contains the instructions on how to test the TensorRT engine locally.
+
+
+## Running TensorRT Engine with RunInference in a Beam Pipeline
+
+Now that you have the TensorRT engine, you can use TensorRT engine with RunInferece in a Beam pipeline that can be run both locally and on GCP.
+
+The following code example is a part of the pipeline, where you use `TensorRTEngineHandlerNumPy` to load the TensorRT engine and set other inference parameters.

Review Comment:
   ```suggestion
   The following code example is a part of the pipeline. You use `TensorRTEngineHandlerNumPy` to load the TensorRT engine and to set other inference parameters.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org