You are viewing a plain text version of this content. The canonical link for it is here.

Posted to events@mxnet.apache.org by apachemxnetday <ap...@nvidia.com> on 2020/11/23 16:43:23 UTC

FW: Proposal/abstract "Serving 1 Million BERT inference requests for 20 cents"

From: Moises Hernandez <mo...@nvidia.com>
Date: Monday, November 16, 2020 at 7:51 AM
To: apachemxnetday <ap...@nvidia.com>
Subject: Proposal/abstract "Serving 1 Million BERT inference requests for 20 cents"

"Serving 1 Million BERT inference requests for 20 cents"

Attention based models like BERT have revolutionized natural language processing (NLP) due to its ability to outperform traditional models on language tasks as shown by their high scores on various NLP benchmarks However, even smaller BERT models have more than 100 million parameters, making it difficult to achieve near-real-time Inference speeds on generic compute hardware. GPUs have generally been outperforming CPUs for BERT Inference, however they generally cost more than CPU instances on AWS. New Tensor core GPUs have been proven to be more cost-effective and efficient for running Inference models. In this talk, we will present a solution for performing inference on the popular BERT model in less than 4ms using Nvidia T4 GPUs on AWS EC2-G4 Instance. We will cover specific optimizations on the model layers, such as Softmax, Bias terms addition, Gaussian Error Linear Units and Multi-Head attention that can significantly accelerate the BERT Inference performance. Our solution is built to improve performance of NLP tasks like Question Answering and Classification tasks like Sentiment Analysis and Domain Classification. All this work has been implemented as part of Apache MXNet and Gluon NLP frameworks, and has been made available as part of the latest MXNet release Lastly, we will cover how a user can leverage the power of SageMaker to deploy the optimized BERT model and be able to serve One Million BERT Inference requests for less than 20 cents on AWS.