You are viewing a plain text version of this content. The canonical link for it is here.
Posted to discuss-archive@tvm.apache.org by msh via Apache TVM Discuss <no...@discuss.tvm.ai> on 2021/08/23 20:45:20 UTC
[Apache TVM Discuss] [Questions] [AUTOSCHEDULER][CUDA] Poor
perfomance for subtraction kernel
I am getting poor performance (in terms of schedule efficiency) of autoscheduler for simple consecutive subtraction kernel:
```
in = te.placeholder((N, H, W), dtype='float')
out = te.compute((N-1, H, W), lambda n, y, x: in[n+1, y, x] - in[n, y, x])
return [in, out]
```
Example (tir) of schedule "found" for particular input sizes:
```
primfn(image_in_1: handle, subtracted_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {subtracted: Buffer(subtracted_2: Pointer(float32), float32, [5, 2160, 3840], []),
image_in: Buffer(image_in_2: Pointer(float32), float32, [6, 2160, 3840], [])}
buffer_map = {image_in_1: image_in, subtracted_1: subtracted} {
attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 648000;
attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
subtracted_2[((blockIdx.x*64) + threadIdx.x)] = ((float32*)image_in_2[(((blockIdx.x*64) + threadIdx.x) + 8294400)] - (float32*)image_in_2[((blockIdx.x*64) + threadIdx.x)])
}
```
Cuda kernel for this schedule runs on my gpu for 1250us, while simple handmade kernel runs for 950us. Is that expected, or any changes in tuning options/operator "phrasing" can be made to gain better performance?
tvm commit 10fca9c
---
[Visit Topic](https://discuss.tvm.apache.org/t/autoscheduler-cuda-poor-perfomance-for-subtraction-kernel/10869/1) to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/fcb886dab61bcc6ecf9f1f617164604b62737c9e15eb6710939687d7b861efc7).