You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by "MingkangW (via GitHub)" <gi...@apache.org> on 2024/02/26 05:35:22 UTC

[I] [Bug] int8 slower than float32 on GPU [tvm]

MingkangW opened a new issue, #16643:
URL: https://github.com/apache/tvm/issues/16643

   ### Expected behavior
   
   The paper on QNN mentions that TVM can accelerate INT8 models on the GPU. 
   
   ### Actual behavior
   
   but my tests didn't yield corresponding results.
   
   ### Environment
   
   TVM Version: Unity
   GPU: 3090
   OS: Ubuntu
   
   ### Steps to reproduce
   
   I am using the following script to run convolution with INT8 and float32.
   ```
   import tvm
   from tvm import relay
   import numpy as np
   from tvm.contrib import graph_executor
   ##int8
   data = relay.var(shape=(1, 3, 2048, 2048), name_hint="data", dtype="int8")
   weight = relay.const(value=((np.random.uniform(size=(64, 3, 3, 3)) * 127).astype("int8")), dtype="int8")
   conv = relay.nn.conv2d(data, weight, channels=64, kernel_size=(3, 3), out_dtype="int32")
   f = relay.function.Function([data, ], conv)
   mod = tvm.IRModule.from_expr(f)
   mod = relay.transform.InferType()(mod)
   print(mod)
   
   target = tvm.target.cuda(arch="sm_86")
   lib = relay.build(mod, target=target)
   dev = tvm.cuda()
   runtime = graph_executor.GraphModule(lib['default'](dev))
   res = runtime.benchmark(dev)
   print(res.mean)
   
   #float32
   data = relay.var(shape=(1, 3, 2048, 2048), name_hint="data", dtype="float32")
   weight = relay.const(value=((np.random.uniform(size=(64, 3, 3, 3)) * 127)), dtype="float32")
   conv = relay.nn.conv2d(data, weight, channels=64, kernel_size=(3, 3), out_dtype="float32")
   f = relay.function.Function([data, ], conv)
   mod = tvm.IRModule.from_expr(f)
   mod = relay.transform.InferType()(mod)
   print(mod)
   
   target = tvm.target.cuda(arch="sm_86")
   lib = relay.build(mod, target=target)
   dev = tvm.cuda()
   runtime = graph_executor.GraphModule(lib['default'](dev))
   res = runtime.benchmark(dev)
   print(res.mean)
   ```
   
   and the output is 
   
   ```
   def @main(%data: Tensor[(1, 3, 2048, 2048), int8] /* ty=Tensor[(1, 3, 2048, 2048), int8] */) -> Tensor[(1, 64, 2046, 2046), int32] {
     nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(64, 3, 3, 3), int8] */, padding=[0, 0, 0, 0], channels=64, kernel_size=[3, 3], out_dtype="int32") /* ty=Tensor[(1, 64, 2046, 2046), int32] */
   }
   
   
   One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
   0.23173750480000002
   def @main(%data: Tensor[(1, 3, 2048, 2048), float32] /* ty=Tensor[(1, 3, 2048, 2048), float32] */) -> Tensor[(1, 64, 2046, 2046), float32] {
     nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(64, 3, 3, 3), float32] */, padding=[0, 0, 0, 0], channels=64, kernel_size=[3, 3], out_dtype="float32") /* ty=Tensor[(1, 64, 2046, 2046), float32] */
   }
   
   
   0.00252002296
   ```
   
   For INT8, it takes 0.23 seconds, while for float32, it takes 0.0025 seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org