You are viewing a plain text version of this content. The canonical link for it is here.

Posted to discuss-archive@tvm.apache.org by kindlehe via TVM Discuss <no...@discuss.tvm.ai> on 2020/04/07 16:30:40 UTC

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu


[ [Torch, QNN] Add support for quantized models via QNN #4977](https://github.com/apache/incubator-tvm/pull/4977) gives performance of quantized Torch models and converted tvm quantized model，but did not give the speed comparison betweem.
Where could I get more speed compason of two kinds of quantation：
1. converting quantized models from torch to Relay via QNN（as #4977 said）
2. TVM int8 quantization and TVM int8 quantization + AutoTVM（as [int8 is slower](https://discuss.tvm.ai/t/the-inference-time-is-longer-after-int8-quantization/3628/3) said）

I am not sure why int8 is slower than float32，and I find many people get the same slow result，but it is very hard to find official tutorial about how to do quantization for pytorch or tf correctiy, or official speed result about quantization，or see any normal users said that they successed to convert py/tf models in accuracy and get wanted faster speed.

Thanks for your kind help！





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/1) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/821c50d501534cd1c4a283bb92650ff66663d5381b2a62965fcc23f629d82a6a).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


> [[topi] add ARM v8.2 udot (uint8) support #3978](https://github.com/apache/incubator-tvm/pull/3978)

This works if you have a machine/device with ARM v8.2 and DOT instruction. Rasp3b and 4b don't have it.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/46) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/c570e276926b893391af0c59514a4868383bcd00744ed2b9ce5d7c97b37850f3).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


[quote="anijain2305, post:36, topic:6256, full:true"]
Yes, that seems plausible. Please note that one might also make FP32 schedule better by working on low-level optimizations :) So, it is relative.
[/quote]


Can I define a new schedule to optimize performance to get the same speed as QNNPACK?
If so, how can I do it, is there any docs or code for references?





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/37) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/e69ce993068b4cab87f40ee8d17ff60e548bd2cae0ac83c2078c8570ec00dbbd).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


Yes, that seems plausible. Please note that one might also make FP32 schedule better by working on low-level optimizations :) So, it is relative.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/36) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/c460f85b7457477e902c9d9f8fd5789736cc4080aa1a9bccc78a23b365d34648).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


@anijain2305 @masahi  
[ [topi] add ARM v8.2 udot (uint8) support #3978](https://github.com/apache/incubator-tvm/pull/3978)

as this commit said, arm platform support udot(uint8), can I reckon that arm can achieve int8-speedup for udot(uint8) support, then what is the right open method?





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/45) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/93bc390fe7c3f32ca91a316cf0783729d8f30497ed5fc0909631cd338f53bc74).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


[quote="janimesh, post:5, topic:3920"]
MobileNet models have slowdown because they use Depthwise convolution that has not been configured to use VNNI instructions.
[/quote]

This might be the reason why tvm is slower than qnnpack. see [link](https://discuss.tvm.ai/t/quantization-story/3920/5?u=kindlehe)





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/38) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/7196354108050d560455c759764153d87b4a3ca823ab69de6dbd332d30b5a71d).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


@masahi  I set `os.environ["TVM_NUM_THREADS"] = str(2)`, but it does not help to the speed.

I also watch the cpu% of `tvm_model.module.time_evaluator` and `pt_model(inp)` by `top` command,
the cpu%<=100%, it maybe means that both tvm and torch only use one thread to do inference.

Here is the  speed comparison at `os.environ["TVM_NUM_THREADS"] = str(2)`:
```shell
Model name: resnet18, per channel quantization
TVM elapsed ms:126.58331409
Torch elapsed ms:34.38628673553467

Model name: resnet50, per channel quantization
TVM elapsed ms:292.58252946
Torch elapsed ms:77.93493032455444

Model name: mobilenet_v2, per channel quantization
TVM elapsed ms:24.695743800000006
Torch elapsed ms:11.568100452423096

Model name: mobilenet_v3 small, per channel quantization
TVM elapsed ms:7.13273288
Torch elapsed ms:9.331259727478027

Model name: mobilenet_v2_pretrained small, per channel quantization
TVM elapsed ms:19.51776834
Torch elapsed ms:13.21192979812622
```





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/20) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/21eeaf7fa2f07be34d9f3bd77bf5af865dd6fb8178f079799760bce0d705bfd3).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


thanks very much！

I will check TVM_NUM_THREADS tomorrow morning.

Have you ever compared the tvm speed of FP32 and INT8 at android arm cpu，do you think tvm@INT8 will make better speed than tvm@FP32 at android device？





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/15) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/ca3f570fa36d22c5d66d0363b7ec5c7cf309c203e372de3eb9b4aabb825c95ba).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.


hmm I don't know why TVM is faster on mobilenet v3. Maybe because this is a newer model that Torch team hasn't optimized for. But please make sure you are setting `TVM_NUM_THREADS` env var correctly (it should be the number of physical cores)

The numbers seem consistent with what I've seen in my testing. I highly encourage you to get an access to a machine with AVX512, preferably with VNNI support (cascadelake). There TVM completely blows away Torch perf and you will feel good and get excited :)





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/14) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/1818d6fa5257e3e4eb0866c0a048badd792d3efb6fe29b9d02532402218a7249).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


@masahi 
I wonder why pytorch can run so fast?
Is it because pytorch use int8 in the same macbook pro, or other speed-up technique？





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/24) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/759c39654501e552fda1cf5fa09b7ded868b3e6f28bdddf37de017de8ac9826d).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.


Yes, int16 thing is intended. See https://github.com/apache/incubator-tvm/pull/4307. @anijain2305 can give more details.

Int8 is only enabled for AVX512.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/23) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/d7b412a80c45b42d5777c657f1cd860c568f4f5495edf62b128e8ae7d3595d31).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


Yes，thanks again for your reply.

I just verified [ tutorial_eager.py](https://github.com/Edgecortix-Inc/pytorch_quantization/blob/master/tutorial_eager.py) @torch-nightly(v1.6) @macbook pro, and get the 2-4x speed-up as the [static_quantization_tutorial](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html#quantization-aware-training) gives. However, the top-1 accuracy drops about 4 point for per_channel_quantized_model, so I will try QAT in Torch to get extra accuracy following your advice.

1. **Deploy a Quantized Model on Cuda** is executed on cuda, dose it support run inference on cpu like mbp or android?
2. Which way do you recommend for high-accuracy and big-speed-up quantization, between **TVM converting quantized torch model** and **TVM quantization independent of frameworks(e.g. tf, pytroch, and so on)** ? 
3. Have you ever evaluate the accuracy and speed-up of the two way? 
4. Which might be better or reasonable for tvm's development in the future from the perspective of TVM framework design

Please excuse my lots of question, hope to discuss tvm with you smart  and kind-hearted guys.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/10) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/a9607f4ade85c5c91fa59713f6d77286c36eb3ad6a3195a52547c2d75f98d0ea).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


Yes，thanks again for your reply.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/9) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/4e1ee30bfcc3973ecc68f53fff1f2cedfe0dbb53b74eb7ecf06c9f4e4ea36edb).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.


1.  I don't have experience using QAT in Torch. I think post training quantization is easier to work with. In any case, post training quantization should be the first thing you should try. If you need extra accuracy, QAT may help.

2. Yes. See https://docs.tvm.ai/tutorials/frontend/deploy_quantized.html#sphx-glr-tutorials-frontend-deploy-quantized-py. This is another quantization support TVM has. Since TVM does quantization, which framework models come from doesn't matter.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/8) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/6e2754f7ee1e0d58f56547a6b337f27dd0a5c540422d2698f1d673c9ecf50224).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


You are correct. I forgot about PyTorch frontend for quantizing. This is true for MXNet as well. 

We can also make a tutorial for all frameworks. you can take care of PyTorch, I can take care of MXNet (similar to PyTorch) and TFLite (easy). It can be just one tutorial with different sections one for each framework.

I agree this will be very hepful.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/6) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/869169efd4b0c345f8cece9371b4685b89eadc0d8ebcbba89ad7833d7d155376).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


[quote="anijain2305, post:34, topic:6256, full:true"]
Yeah, the work by AliOS is not available yet. They worked a lot on very low-level optimizations. Over time, this work will hopefully be upstreamed. For now, on master, QNNPACK is faster.
[/quote]

Your also said **For rasp3 and rasp4, we saw 1.3x - 1.5x performance speedup going from FP32 to Int8.**  If your conclusion **qnnpack-int8 is faster than tvm-int8 @rasp4** is right, then tvm can get **more than** 1.3x - 1.5x (maybe 1.5x - 1.8x, or more, just a guess) performance speedup going from FP32 to Int8 upon tvm-int8 is fast as qnnpack-int8@rasp4 ?





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/35) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/45d85f5d955d052a29181547cfe8e3f05e485f03f7a4cf4fd207b1535ea8ad34).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


Yeah, the work by AliOS is not available yet. They worked a lot on very low-level optimizations. Over time, this work will hopefully be upstreamed. For now, on master, QNNPACK is faster.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/34) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/78f1fc290370e638f1040bea3765e05f6bb735dbb9accb79e9d4866b1310fb66).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


Thanks for your detailed reply very much!

I will try your suggestion to try these scripts later.

However, there are still some question for me:
1. Should I use `Post-training static quantization` or `Quantization-aware training` for my own model as [static_quantization_tutorial](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html) said, before I apply `[Torch, QNN] Add support for quantized models via QNN #4977` or use scripts you offered above ?
2. Dose tvm itself support convert a FP32 pytorch model to int8 model ? If so, how to do it to keep accuracy and speed?

Nice to see you are all so good to offer helps for users！
Thanks sincerely and expect for your updated resources on tvm quantization.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/7) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/3c67cf434f3ea3c7b81c18c423ac7cd2bf6efff4960ba7add30a7598d675ec15).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


This problem is solved by rebuild tvm in a correct way.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/12) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/e57666c6b5e2bbec6d59700d15dc7a624cf8e9ee2a85b2c270895b8c2fe58c00).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


I revise the input name in [imagenet_test.py](https://github.com/Edgecortix-Inc/pytorch_quantization/blob/master/tvm_qnn_evaluation/imagenet_test.py) as following:
![image|690x218](upload://zZ5U3HUpfJvYzsQhgEyqfR3s6PI.png)  

But get the following error while execute for resnet18 model:
![image|690x335](upload://dJFkXhPRYoLmFgC0XIWxVYnlIHB.png) 
![image|690x350](upload://b8eaqlFpBBunlOtT0Hmkt0jUirW.png) 
Here is my tvm build version:

ea0638886 (HEAD -> master, origin/master, origin/HEAD) [BUGFIX][IR] Fix String SEqual (#5275)





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/11) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/e513b08d314c5dbbb8f010288cb5ac4813c3a268066fa2d65d8ab52e365fdad9).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


@masahi @anijain2305  
I am not very sure whether INT8 is used in `perf_bench`, due to I see these log:
```
autotvm:Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense_nopack.x86', ('TENSOR', (1, 1280), 'int16'), ('TENSOR', (1000, 1280), 'int16'), None, 'int32'). A fallback configuration is used, which may bring great performance regression.
```
I doubt that tvm slower than torch in 1-core is cuased by unused INT8.

One possible case is that tvm convert INT8 weights to int16 as the above log said, and do inference in INT16 instead of INT8, while torch do inference in INT8. Namely, we compare spped of tvm-int16 and torch-int8. This is just a assumption, But I don't know how to check whether INT8 is used in tvm inference.

If so, this might be a bug or something else wrong for TVM (not sure yet, just a guess :grinning:).





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/21) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/4d9982ce994c3d89d3e4e7794743d08788c9f26a4f4f0e135b9e67523b848f4b).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.


Yes, without HW support for int8, you shouldn't expect int8 to be any faster than fp32. For avx2, Torch is much faster than TVM for int8. For avx512, where int8 does make a difference, TVM is much faster. 

I have a script https://github.com/Edgecortix-Inc/pytorch_quantization/tree/master/tvm_qnn_evaluation which can also be used for perf benchmark. Set this to True
https://github.com/Edgecortix-Inc/pytorch_quantization/blob/master/tvm_qnn_evaluation/imagenet_test.py#L82 and pick your target here https://github.com/Edgecortix-Inc/pytorch_quantization/blob/master/tvm_qnn_evaluation/test_util.py#L63

* For skylake with avx512 support, the target should be "llvm -mcpu=skylake-avx512
* For cascadelake, "llvm -mcpu=cascadelake"

Maybe @anijain2305 can give more comments.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/2) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/500eadc82da727ca72d6066bf857afbf376ebda416a12a3196d5f844fd62fc35).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


The speed is tested on 2 cores for tvm and 1 core for torch,
so tvm@mobilenet-v3 is faster thant torch@mobilenet-v3





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/22) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/8dbc960a7cd09c84818a8d9039935777b9b017c04762181416dfa2703e8cbf42).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


Here is the spped comparison of quantized pytorch model and converted tvm model at macbook pro.

I have no idea why tvm is faster than torch for mobilenet-v3, but slower for resnet-18, resnet-50 and mobilenet-v2?

![image|690x396](upload://2ZCtF54A2wBVxKC0KDZZ23jyriT.png)





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/13) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/3408a0896df0a59373c2c2e9a384b70c949aa084426875f30414cbf6f4e97d9b).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


[quote="anijain2305, post:31, topic:6256, full:true"]
Yes, thats the selling point of TVM.

TVM community works together on these TVM schedules. As we get more people interested in quantization, we can add more TVM schedules, for e.g., avx2 machine you are talking about. We dont want to fully rely on FBGEMM or QNNPACK, because it might cause conflicts/prevents some other optimizations that TVM is good at.
[/quote]


I agree with the design principle about quantization between FBGEMM and TVM schedule optimization. 

In a word, depends more, problems more.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/32) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/d0c0cbb156948f2f059ca839b8d1ae82be79b4ecd1c2b015c0a1e58ce8d0a516).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


Yes, thats the selling point of TVM.

TVM community works together on these TVM schedules. As we get more people interested in quantization, we can add more TVM schedules, for e.g., avx2 machine you are talking about. We dont want to fully rely on FBGEMM or QNNPACK, because it might cause conflicts/prevents some other optimizations that TVM is good at.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/31) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/c724d19bb01e628226ad27a339e99732e06d8241d85d405d37e301d20c13c294).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


[quote="anijain2305, post:27, topic:6256, full:true"]
For rasp3 and rasp4, we saw 1.3x - 1.5x performance speedup going from FP32 to Int8.

The link comparing QNNPACK and TVM is not upstream'd yet. If I understand correctly, it will be sometime before the authors of that work will be able to make it to upstream. There are some differences in underlying design as well, which might cause some delays in getting to that performance.

Regarding int16, we observed that LLVM can generate good enough good with int16 instead of int8 for rasp3/4. So we uplift the datatype to int16 (exception is Intel Cascadelake and Nvidia devices). When we write a better schedule with int8 datatypes, we can remove the upcasting.
[/quote]

@anijain2305 
Hi, I test speed of fp32 and int8 for squeezenet on android device(arm64-v8a).
Here is my config,
```
target = 'llvm -device=arm_cpu -target=aarch64-linux-android -mattr=+v8.2a,+dotprod'
```
First, I load fp32 model by `mod, params = relay.frontend.from_onnx(onnx_model, input_shapes) Then, convert fp32 model to int8 by `relay.quantize` of tvm's own
`
```
with relay.quantize.qconfig(calibrate_mode='global_scale',
                                    global_scale=8.0):
            mod = relay.quantize.quantize(mod, params)
```
Here we only focus on int8 speed-up compared int32, instead of accuracy.
```
# Some Log For INT32:
WARNING:autotvm:Cannot find config for target=llvm -device=arm_cpu -target=arm64-linux-android -mattr=+v8.2a,+dotprod, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 512, 9, 9), 'int8'), ('TENSOR', (1000, 512, 1, 1), 'int8'), (1, 1), (0, 0, 0, 0), (1, 1), 'int32'). A fallback configuration is used, which may bring great performance regression.
Mean inference time (std dev): 24.38 ms (3.62 ms)

# Some Log For INT8
Cannot find config for target=llvm -device=arm_cpu -target=arm64-linux-android -mattr=+v8.2a,+dotprod, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 512, 9, 9), 'float32'), ('TENSOR', (1000, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Mean inference time (std dev): 17.50 ms (2.44 ms)
```
You said 1.3x - 1.5x performance speedup going from FP32 to Int8 for rasp3 and rasp4,
Could you please give some advice on:
1)  how to eliminate "Cannot find config for " Warning.
2)  how to achieve the 1.3x - 1.5x performance speedup, could you share your scripts for easy test?

Thanks very much!





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/42) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/a33d786b419c42c8b61f7aa4299bf537dbeaa3c6929054b9db457408011ccf06).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


[quote="anijain2305, post:27, topic:6256, full:true"]
For rasp3 and rasp4, we saw 1.3x - 1.5x performance speedup going from FP32 to Int8.

The link comparing QNNPACK and TVM is not upstream'd yet. If I understand correctly, it will be sometime before the authors of that work will be able to make it to upstream. There are some differences in underlying design as well, which might cause some delays in getting to that performance.

Regarding int16, we observed that LLVM can generate good enough good with int16 instead of int8 for rasp3/4. So we uplift the datatype to int16 (exception is Intel Cascadelake and Nvidia devices). When we write a better schedule with int8 datatypes, we can remove the upcasting.
[/quote]

Thanks for your speedup report for reference, TVM is really an excellent framework!
Hope to see more speedup comparison between tvm and qnnpack  and ohter DL frameworksfor at different cpu, like arm, interl, and so on. The speedup data can help users to estimate the up-limited compute in some certain devices, which can significantly help release the risk of diving into a wrong direction.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/30) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/804c5c994fcd8b6f67fc7567cb3ebd593eaa9c080d3237442629c7dd0f7d0b26).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


[quote="kindlehe, post:19, topic:6256, full:true"]
@anijain2305 
How much speedup does FP32 compared INT8 at rasp4？1.5×？

I saw some speedup conclusion [here](https://github.com/tvmai/meetup-slides/tree/master/tvm-meetup-shanghai-Nov-16-2019) saying that tvm is about 1.3×（=2.08/1.60）at mobilenet-v2@rasp 3b+AARCH64 than QNNPACK.

They reported apparent speedup for both mobilenet-v1 and mobilene-v2：
![image|690x431](upload://oqRljzqKWe45ll979kPI6Z8PeOE.jpeg) 

However，you say qnnpack-int8 is better than tvm-int8 @rasp4，which conclusion is more reliable？

If qnnpack is better，than why tvm develop int8 of its own instead of using qnnpack？
[/quote]

[quote="anijain2305, post:27, topic:6256, full:true"]
For rasp3 and rasp4, we saw 1.3x - 1.5x performance speedup going from FP32 to Int8.

The link comparing QNNPACK and TVM is not upstream'd yet. If I understand correctly, it will be sometime before the authors of that work will be able to make it to upstream. There are some differences in underlying design as well, which might cause some delays in getting to that performance.

Regarding int16, we observed that LLVM can generate good enough good with int16 instead of int8 for rasp3/4. So we uplift the datatype to int16 (exception is Intel Cascadelake and Nvidia devices). When we write a better schedule with int8 datatypes, we can remove the upcasting.
[/quote]

tvm-int8:qnnpack@rasp 3b+AARCH64=1.3x faster said by AliOS, but you said qnnpack-int8 is faster than tvm-int8 @rasp4, what do you think about the mismatch?

Your also said **For rasp3 and rasp4, we saw 1.3x - 1.5x performance speedup going from FP32 to Int8.**  If your conclusion **qnnpack-int8 is faster than tvm-int8 @rasp4** is right, then tvm can get **more than** 1.3x - 1.5x (maybe 1.5x - 1.8x, or more, just a guess) performance speedup going from FP32 to Int8 upon tvm-int8 is fast as qnnpack-int8@rasp4 ?





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/33) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/b55b65f896732beb67f17e8a00cffb7155c55a8d96f355f7e750f4dbd1b3eaec).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


For rasp3 and rasp4, we saw 1.3x - 1.5x performance speedup going from FP32 to Int8.

The link comparing QNNPACK and TVM is not upstream'd yet. If I understand correctly, it will be sometime before the authors of that work will be able to make it to upstream. There are some differences in underlying design as well, which might cause some delays in getting to that performance.

Regarding int16, we observed that LLVM can generate good enough good with int16 instead of int8 for rasp3/4. So we uplift the datatype to int16 (exception is Intel Cascadelake and Nvidia devices). When we write a better schedule with int8 datatypes, we can remove the upcasting.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/27) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/fbe8d42637387d54830b723d7de31d07150c77d9dcee57ba0b3182c908483459).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


How much speedup does FP32 compared INT8 at rasp4？1.5×？

I saw some speedup conclusion [here](https://github.com/tvmai/meetup-slides/tree/master/tvm-meetup-shanghai-Nov-16-2019) saying that tvm is about 1.3×（=2.08/1.60）at mobilenet-v2@rasp 3b+AARCH64 than QNNPACK.

They reported apparent speedup for both mobilenet-v1 and mobilene-v2：
![image|690x431](upload://oqRljzqKWe45ll979kPI6Z8PeOE.jpeg) 

However，you say qnnpack-int8 is better than tvm-int8 @rasp4，which conclusion is more reliable？

If qnnpack is better，than why tvm develop int8 of its own instead of using qnnpack？





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/19) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/7459dfeb62fb3b24217c6b013ca003831848260a54a7e5fd0e93b87958d65c55).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


@kindlehe TVM might not be optimized for target 'llvm -mcpu=core-avx2'. I would suggest running it on CascadeLake. You would see major benefit.

For rasp4, if you are comparing FP32 vs Int8, yes I have seen performance improvements. However, if you compare PyTorch (backed by QNNPACK) int8 vs TVM int8, PyTorch does pretty well, especially for Mobilenet.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/18) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/e97268542c91cff9f2c4550e7615500d8bc9e5355850b413859f7bf3a317302f).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.

@anijain2305

thanks a lot.

I thought the tvm relay quantize is the same as tvm model converted from pre-quantized.

I also test tvm-int8 model from pytorch qat model, the speed is the same as tvm-relay-quantize-int8 model.

I really have no idea how to get 1.3x -1.5x speedup no matter pre-quantize-int8-model or tvm-relay-quantize-int8-model. I m eager for your kind help to reproduce the speedup on android arm device.

Nice to see your effort for quantize tutorial!

I also recommend that you give more tutorials on how to get desired int8 speedup compared fp32 on the supported device platform.

You know, many tvm users are not experienced enough as tvm authors, they may want to see more tutorials or reports on why choose tvm-quantize instead of other dl framework’s quantize.

Also, more testcase on cpu-int8 will be a great help, e.g. what are the supported cpu device for int8-quantize, how to set proper target for different fashion devices (I saw many many users asked the usage about TARGET, but still not sure which TARGET setting can achieve best performance for related device.)

Thanks again for all tvm’s authors and contributors.

---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/44) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/ce3587ddee2929f46fbc1a271dba9be91651c393b4aace340e7ec4d9c9adcac5).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


I have mostly worked on pre-quantized models. So, I cant comment on the performance of Relay quantized model through ARM. There might be few missing pieces there.

I am planning to write a tutorial by next week on how to read pre-quantized models from TFLite. You can also try @masahi tutorial on PyTorch pre-quantized model import if you are blocked by this - https://github.com/apache/incubator-tvm/pull/5321

For eliminating "Cannot find config", you will have to tune the model on the device - https://tvm.apache.org/docs/tutorials/autotvm/tune_relay_arm.html#sphx-glr-tutorials-autotvm-tune-relay-arm-py





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/43) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/a3f3092d139f5dbdc0e319828a2c170a21951ead7d2a92452a386e7d94432699).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


It is very difficult to estimate. Different people code with different pace.

I can share my experience, but I am not sure if you should treat it seriously. My first task in TVM was to use Intel VNNI instructions for conv2d schedule. This took me around a month. I am not sure, how involved QNNPACK is. But, tutorials are better now and we are always helpful :)





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/41) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/52b393a5cf84dc9dcf3371ce58505b45f8550fccb5717f006a82e793ff3f35f2).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.


[quote="anijain2305, post:4, topic:6256"]
do not see tutorial to be very different from FP32 compilation
[/quote]

Yes, for tflite where you can just download pre-quantized model from their zoo, I don't think it would be different from fp32. For PyTorch it is a bit more complicated :) All the bits are already there in tests/python/frontend/pytorch/qnn_test.py, but adding a tutorial is a good idea.

[quote="anijain2305, post:4, topic:6256"]
Do you want to first have a PyTorch FP32 tutorial and then maybe we can build on top of that as part 2?
[/quote]

We already have the Torch fp32 tutorial at https://github.com/apache/incubator-tvm/blob/master/tutorials/frontend/from_pytorch.py. I'll send a tutorial for quantized models.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/5) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/7130168435a395dd2d040733b44685ea7cd03743a758ae34eb8efa6657eadba1).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


Thanks @kindlehe @masahi

Masa explained it correctly. For a long time, processors had higher FP32 throughput than Int8 throughput.    So, it is not fair to assume that quantization will give you performance benefits on all the machines. Check Intel VNNI, Nvidia DP4A and tensor cores, and ARM DOT instructions. These are the hardware vendor efforts to make int8 throughput better than FP32 throughput.

I do not see tutorial to be very different from FP32 compilation. But, I do see value in just writing one to get people started. I can take care of that. @masahi Do you want to first have a PyTorch FP32 tutorial and then maybe we can build on top of that as part 2?





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/4) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/02c01029d670f27079c2959d8070fd6898db18299144ce77d79d66dfbcd8cc26).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.


No, but I think @anijain2305 has done such comparison on rasp4.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/16) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/36d926eb2f2edcbc30d715579f8401a1e55515383dd6c9f6f4d854cf1285caee).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


[quote="anijain2305, post:39, topic:6256, full:true"]
QNNPACK is for ARM, whereas VNNI instructions are for Intel. So, not exactly that reason. But, the underlying statement might still be the case, that we dont have good TVM schedules. 

Regarding schedules to get same speedup as QNNPACK, we can write assembly implementation in TVM schedule and match QNNPACK implementation. There will be some issues like QNNPACK quantized conv2d might be different than Relay conv2d in terms of compute etc. But, yes, it should be possible to get as good as QNNPACK (and better).

For TVM Schedules, I would suggest going through tutorials here - https://tvm.apache.org/docs/tutorials/index.html
[/quote]

@anijain2305
 
Do you think it is diffucult for whom have no experience in TVM schedule before？

How many days will be spend to get the same speedup as QNNPACK if I keep diving into TVM schedule according to your experience and estimation？Half a month，or，even longer？





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/40) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/cb43e03bed8c18199155ada4bab24a2150e3d70e9e9c5ecdbdd0666ec9267b40).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by Animesh Jain via TVM Discuss <no...@discuss.tvm.ai>.


QNNPACK is for ARM, whereas VNNI instructions are for Intel. So, not exactly that reason. But, the underlying statement might still be the case, that we dont have good TVM schedules. 

Regarding schedules to get same speedup as QNNPACK, we can write assembly implementation in TVM schedule and match QNNPACK implementation. There will be some issues like QNNPACK quantized conv2d might be different than Relay conv2d in terms of compute etc. But, yes, it should be possible to get as good as QNNPACK (and better).

For TVM Schedules, I would suggest going through tutorials here - https://tvm.apache.org/docs/tutorials/index.html





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/39) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/cfa52aa9f0be8fd0eebee6ab1358800fdd89b39a3b15f11059afb280d226c568).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


[quote="masahi, post:28, topic:6256, full:true"]
[quote="kindlehe, post:26, topic:6256"]
Will tvm consider integrating FBGEMM to get the same heavy lifting in the future as pytorch has done to support the same high speedup in avx2 device?
[/quote]

No. We should rather improve our avx2 schedule to match FBGEMM performance.
[/quote]


Sounds great! When do you think tvm avx2 schedule matches FBGEMM performance?





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/29) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/e27e04a9a0c8737f5365f0ee0d0a1cb5d7fcf846afacd8951e8ad40f6b0aa2c6).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.


[quote="kindlehe, post:26, topic:6256"]
Will tvm consider integrating FBGEMM to get the same heavy lifting in the future as pytorch has done to support the same high speedup in avx2 device?
[/quote]

No. We should rather improve our avx2 schedule to match FBGEMM performance.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/28) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/f81b6e47cb3da1bd4c0dabf145a1c01ebdfa5eff126a040a43e6d486ff0e0e54).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by kindlehe via TVM Discuss <no...@discuss.tvm.ai>.


[quote="masahi, post:25, topic:6256"]
https://github.com/pytorch/FBGEMM
[/quote]

Will tvm consider integrating FBGEMM to get the same heavy lifting  in the future as pytorch has done to support the same high speedup in avx2 device?





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/26) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/d698e295c69848f9751babd99e6666bb918482eecc5801f7b018e359ec981d40).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.


Yes it is incredible. Quantized Torch uses FBGEMM https://github.com/pytorch/FBGEMM to do the heavy lifting. They jit generate asm. I have no idea how their quantized convolution is implemented. You can take a look at their code.





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/25) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/ba7d9a56208d3ed7441d3f62a536a0267460751484665c69e4d5a4d2cb508f8f).

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.


[quote="kindlehe, post:1, topic:6256"]
but it is very hard to find official tutorial about how to do quantization for pytorch or tf correctiy
[/quote]

Yes, this is a good point. @anijain2305 do we have a plan to send a tutorial for how to convert from pre-quantized models?





---
[Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/3) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/b7ec4f0ca7b930c6c66b57297f8f618337c7fa45e3ae56403ede44700ab6b74c).