You are viewing a plain text version of this content. The canonical link for it is here.

Posted to discuss-archive@tvm.apache.org by Le Xu via Apache TVM Discuss <no...@discuss.tvm.ai> on 2020/10/05 05:34:49 UTC

[Apache TVM Discuss] [Questions] Matrix multiplication example for Cuda


Hi! I have been studying how TVM works and I tried out this (https://github.com/apache/incubator-tvm/blob/master/tutorials/autotvm/tune_simple_template.py) tutorial example from the website and it seems like running this example with cuda (or OpenCL) produces errors like:

> Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('tutorial/matmul', 512, 512, 512, 'float32'). A fallback configuration is used, which may bring great performance regression.
> Traceback (most recent call last):
>   File "tune_simple_template.py", line 321, in <module>
>     func = tvm.build(s, arg_bufs)
>   File "/root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/driver/build_module.py", line 413, in build
>     mod_host, mdev = _build_for_device(input_mod, tar, target_host)
>   File "/root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/driver/build_module.py", line 255, in _build_for_device
>     mod_mixed = tvm.transform.Sequential(opt_mixed)(mod_mixed)
>   File "/root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/ir/transform.py", line 127, in __call__
>     return _ffi_transform_api.RunPass(self, mod)
>   File "tvm/_ffi/_cython/./packed_func.pxi", line 321, in tvm._ffi._cy3.core.PackedFuncBase.__call__
>   File "tvm/_ffi/_cython/./packed_func.pxi", line 256, in tvm._ffi._cy3.core.FuncCall
>   File "tvm/_ffi/_cython/./packed_func.pxi", line 245, in tvm._ffi._cy3.core.FuncCall3
>   File "tvm/_ffi/_cython/./base.pxi", line 160, in tvm._ffi._cy3.core.CALL
> tvm._ffi.base.TVMError: Traceback (most recent call last):
>   [bt] (5) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x65) [0x7f0f613a6035]
>   [bt] (4) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(+0x6d4af6) [0x7f0f6097caf6]
>   [bt] (3) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x2c8) [0x7f0f6097b8f8]
>   [bt] (2) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x12f) [0x7f0f6097c5af]
>   [bt] (1) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(+0x8c352d) [0x7f0f60b6b52d]
>   [bt] (0) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(+0x8c00a2) [0x7f0f60b680a2]
>   Did you forget to bind?
>     Variable `B` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
>     Variable `A` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
>     Variable `C` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
>     Variable `C` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
>     Variable `C` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
>   File "/local/incubator-tvm/src/tir/analysis/verify_memory.cc", line 202
> RuntimeError: Memory verification failed with the following errors:
> PrimFunc([A, B, C]) attrs={"global_symbol": "default_function", "tir.noalias": (bool)1, "target": cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32} {
>   for (i.outer, 0, 512) {
>     for (j.outer, 0, 512) {
>       C[((i.outer*512) + j.outer)] = 0f
>       for (k, 0, 512) {
>         C[((i.outer*512) + j.outer)] = (C[((i.outer*512) + j.outer)] + (A[((i.outer*512) + k)]*B[((k*512) + j.outer)]))
>       }
>     }
>   }
> }


Is there any quick fix I can modify to demonstrating GEMM optimization on GPUs? Any pointers are approciated!





---
[Visit Topic](https://discuss.tvm.apache.org/t/matrix-multiplication-example-for-cuda/8078/1) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/853c06aa2161169cfea056557ee02a01597746f6f40b5e624b4f797577f80922).

[Apache TVM Discuss] [Questions] Matrix multiplication example for Cuda

Posted by Tristan Konolige via Apache TVM Discuss <no...@discuss.tvm.ai>.


Kernels running on the GPU require all memory accesses to be within a thread or a block. The file you are looking does not do any thread binding. I suggest looking at this tutorial: https://tvm.apache.org/docs/tutorials/optimize/opt_conv_cuda.html





---
[Visit Topic](https://discuss.tvm.apache.org/t/matrix-multiplication-example-for-cuda/8078/2) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/386e491ff2bb8085cd439ad0adf0c0814bb972800509afe7c5b04521af77db35).