You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/09/24 12:51:39 UTC

[GitHub] [tvm] masahi edited a comment on issue #8294: [AMP] CUDA support for mixed precision pass

masahi edited a comment on issue #8294:
URL: https://github.com/apache/tvm/issues/8294#issuecomment-926599422


   I finally finished collecting data on FP16 performance using Tensocore. Since the NHWC conv2d tensorcore schedule requires a batch size multiple of at least 8, all batch sizes are 8. The speed up over FP32 (ansor), which is a strong baseline, is mixed. I expected better performance from tensorcore, but I guess our tensorcore schedules have a room for improvement (also hit a lot of errors when tuning tensorcore schedules, due to invalid schedules). 
   
   In most cases, we are much slower than TensorRT (not sure if TensorRT `deeplabv3` number is a bit off or not). 
   
   All numbers in milli seconds and measured on RTX 3070. All models are in the NHWC layout.
   
   Model name | Input size | FP32 (Auto scheduler, no tensorcore)| FP16 (AutoTVM, tensorcore)| FP16 TensorRT
   -- | -- | -- | -- | --
   resnet50   | (8, 3, 224, 224) | 8.61 | 4.14 | 2.53
   efficientnet_v2 | (8, 3, 224, 224) | 21.6 | 13.2 | 5.25
   YOLOv5l         |  (8, 3, 512, 512)| 32.4 | 13.22 | NA
   DETR-R50            | (8, 3, 800, 750) | 108.3 | 80.5 | NA
   deeplabv3_mobilenet_v3_large | (8, 3, 512, 512) | 22.6 | 15.9 | 19.2
   bert_large | (8, 128) | 109.9 | 24.2 | 14.0
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org