You are viewing a plain text version of this content. The canonical link for it is here.
Posted to discuss-archive@tvm.apache.org by nolan via TVM Discuss <no...@discuss.tvm.ai> on 2020/08/31 02:43:50 UTC
[TVM Discuss] [Questions] Performance of same op and workload in
different model varies differently
Compared two similar Bert models running on CPU with TVM, one is PyTorch model, the other is MXNet model. Due to the large performance difference, I did some profiling. The result shows the run time of the same operation(matmul) with same workload varies big.
ENV:
1. TVM: build with MKL.
2. Intel CPU
3. OpenMP: `KMP_AFFINITY=compact,1,0 OMP_NUM_THREADS=24`
Model inference time:
# mxnet model
TVM Mean inference time: 5.53 ms
# pytorch model
TVM Mean inference time: 23.05 ms
Profiling result:
# MXNet model
Node Name Ops. Time(us) Time(%) Shape. Inputs Outputs
---------
fused_nn_dense_add_15 fused_nn_dense_add_1 308.926 5.58 (32, 768) 3 1
fused_nn_dense_add_11 fused_nn_dense_add_1 307.277 5.551 (32, 768) 3 1
# PyTorch Model
Node Name Ops. Time(us) Time(%) Shape. Inputs Outputs
---------
fused_nn_dense_add_3 fused_nn_dense_add_3 1783.75 7.631 (32, 768) 3 1
fused_nn_dense_add_31 fused_nn_dense_add_3 1593.08 6.815 (32, 768) 3 1
IR code (same between PyTorch model and MXNet model)
attr [0] "compute_scope" = "fused_nn_dense_add_3_compute_";
attr [C: handle] "storage_scope" = "global";
allocate(C, float32, [24576]) {
attr [0] "extern_scope" = 0;
@tir.tvm_call_packed("tvm.contrib.cblas.matmul", @tir.tvm_stack_make_array(placeholder, @tir.tvm_stack_make_shape(32, 3072, dtype=handle), 0, 2, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(placeholder_1, @tir.tvm_stack_make_shape(768, 3072, dtype=handle), 0, 2, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(C, @tir.tvm_stack_make_shape(32, 768, dtype=handle), 0, 2, 0f32, 0, dtype=handle), False, True, dtype=int32)
for (ax0: int32, 0, 32) "parallel" {
for (ax1: int32, 0, 768) {
T_add[((ax0*768) + ax1)] = ((float32*)C[((ax0*768) + ax1)] + (float32*)placeholder_2[ax1])
}
}
However, when setting `OMP_NUM_THREADS=1` the model inference time is same, seems it's a problem with multiple threads.
What may cause the difference?
Refer to: https://github.com/apache/incubator-tvm/issues/6354
---
[Visit Topic](https://discuss.tvm.ai/t/performance-of-same-op-and-workload-in-different-model-varies-differently/7766/1) to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/1f49529833bf95b6136dfa3765c338c333e41f2eab8e1d07c23236576a59fb62).
[TVM Discuss] [Questions] Performance of same op and workload in
different model varies differently
Posted by nolan via TVM Discuss <no...@discuss.tvm.ai>.
There is no thread related ops. Besides, multi threads is faster than one threads.
---
[Visit Topic](https://discuss.tvm.ai/t/performance-of-same-op-and-workload-in-different-model-varies-differently/7766/3) to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/7ae7d04220c316c89804fa42b109a84b8ba05d0a3622ac6a1cc84149f2238333).
[TVM Discuss] [Questions] Performance of same op and workload in
different model varies differently
Posted by Chenfan via TVM Discuss <no...@discuss.tvm.ai>.
> However, when setting `OMP_NUM_THREADS=1` the model inference time is same, seems it’s a problem with multiple threads.
Will it be possible that there's any thread realated limitation in your pytorch script?
---
[Visit Topic](https://discuss.tvm.ai/t/performance-of-same-op-and-workload-in-different-model-varies-differently/7766/2) to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/a03bfaf5a782eb6a17e831fad6edb36e238d60dc02b60046f8e7c3fe8062578b).