You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2022/12/28 09:56:52 UTC
[GitHub] [tvm] ekalda opened a new pull request, #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
ekalda opened a new pull request, #13669:
URL: https://github.com/apache/tvm/pull/13669
topi.arm_cpu.schedule_conv2d_NHWC_quantized_native was failing compilation in case the input channels divided by 4 was less than 4.
This was because we were splitting this axis by a factor of 4 to create appropriate loop nest for tensorize, but then tensorize was assuming that the outer axis bound was divisible by 4.
If the outer bound was less than 4, compilation failed, if it was greater than 4 but not divisible by 4, we were occasionally accessing data outside of tensor, which luckily was padded due to alignment (I think).
So here we make sure that we explicitly pad the input axis such that the outer loop will always be divisible by 4.
There are also some refactors to test_topi_conv2d_int8.py:
- decouple the tests using pytest.parametrize
- extend the NHWC int8 schedules test to test against arm targets and various schedules. When these schedules were initialy added, we didn't have Arm CI, so only compilation was tested, now we can also run the workloads on Arm targets.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] Mousius commented on pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
Mousius commented on PR #13669:
URL: https://github.com/apache/tvm/pull/13669#issuecomment-1367515561
LGTM @ekalda, thanks for making great strides improving the tests here 😸 I'll leave it open a little longer but otherwise I think this is good to go
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] ekalda commented on a diff in pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
ekalda commented on code in PR #13669:
URL: https://github.com/apache/tvm/pull/13669#discussion_r1058903606
##########
python/tvm/topi/nn/conv2d.py:
##########
@@ -606,8 +606,8 @@ def conv2d_gemm_weight_transform(kernel, tile_rows, tile_cols):
if N % tile_rows != 0:
pad_N = tile_rows - (N % tile_rows)
- if K % tile_cols != 0:
- pad_K = tile_cols - (K % tile_cols)
+ if K % (tile_cols * 4) != 0:
+ pad_K = (tile_cols * 4) - (K % (tile_cols * 4))
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] ekalda commented on a diff in pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
ekalda commented on code in PR #13669:
URL: https://github.com/apache/tvm/pull/13669#discussion_r1058903893
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -35,261 +35,138 @@
import platform
-def compile_conv2d_NHWC_gemm_int8_arm(
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
- A = te.placeholder((batch, in_height, in_width, in_channel), name="A", dtype="int8")
- W = te.placeholder((kernel, kernel, in_channel, num_filter), name="W", dtype="int8")
- bias = te.placeholder((num_filter,), name="bias", dtype="int8")
- dtype = "int32"
- devices = [
- (
- "llvm --device arm_cpu --mtriple aarch64-linux-gnu",
- topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
- topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
- ),
- (
- "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+dotprod",
- topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
- topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
- ),
- (
- "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+dotprod",
- topi.arm_cpu.compute_conv2d_NHWC_quantized_native,
- topi.arm_cpu.schedule_conv2d_NHWC_quantized_native,
- ),
- # TODO(giuseros) Need LLVM-11 in order to compile with +i8mm extension
- # (
- # "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+i8mm",
- # topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
- # topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
- # ),
- ]
-
- for device_tuple in devices:
- target = device_tuple[0]
- compute = device_tuple[1]
- schedule = device_tuple[2]
-
- dev = tvm.device(target, 0)
- if not tvm.testing.device_enabled(target):
- print("Skip because %s is not enabled" % target)
- return
- print("Compiling on arm AArch64 target: %s" % target)
- with tvm.target.Target(target) as tvm_target:
- assert tvm_target.features.is_aarch64, "AArch64 target not recognized"
+devices = [
+ (
+ "llvm",
+ topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
+ topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
+ ),
+ (
+ "llvm --device arm_cpu --mtriple aarch64-linux-gnu",
+ topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
+ topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
+ ),
+ (
+ "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+dotprod",
+ topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
+ topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
+ ),
+ (
+ "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+dotprod",
+ topi.arm_cpu.compute_conv2d_NHWC_quantized_native,
+ topi.arm_cpu.schedule_conv2d_NHWC_quantized_native,
+ ),
+ # TODO(giuseros) We need LLVM-11 in order to compile with +i8mm extension
+ # (
+ # "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+i8mm",
+ # topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
+ # topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
+ # ),
+]
+
+
+@tvm.testing.requires_llvm
+@pytest.mark.parametrize("device", devices)
+@pytest.mark.parametrize(
+ "params",
+ [
+ # Subset of inception v3 expanded (dilation > 1, batch > 1, 'VALID' padding)
+ (1, 3, 299, 32, 3, 2, "SAME", 1, False, False),
+ (1, 32, 149, 32, 3, 1, "SAME", 2, False, False),
+ (4, 32, 147, 64, 3, 1, "SAME", 1, False, False),
+ (1, 64, 73, 80, 1, 1, "SAME", 1, False, False),
+ (1, 80, 73, 192, 3, 1, "SAME", 1, False, False),
+ (1, 192, 35, 48, 1, 1, "SAME", 1, False, False),
+ (1, 192, 35, 64, 1, 1, "VALID", 1, False, False),
+ (1, 192, 35, 32, 1, 1, "SAME", 1, False, False),
+ (1, 48, 35, 64, 5, 1, "SAME", 1, False, False),
+ (1, 96, 35, 96, 3, 1, "SAME", 1, False, False),
+ (1, 256, 35, 48, 1, 1, "SAME", 1, False, False),
+ (1, 256, 35, 64, 1, 1, "SAME", 1, False, False),
+ (1, 288, 35, 64, 1, 1, "SAME", 1, False, False),
+ (1, 288, 35, 48, 1, 1, "SAME", 1, False, False),
+ (1, 96, 35, 96, 3, 2, "SAME", 1, False, False),
+ (1, 128, 17, 192, 7, 1, "SAME", 2, False, False),
+ (1, 160, 17, 160, 7, 1, "SAME", 1, False, False),
+ (1, 160, 17, 192, 1, 1, "VALID", 1, False, False),
+ (1, 192, 17, 192, 1, 1, "SAME", 1, False, False),
+ (1, 768, 5, 128, 1, 1, "SAME", 1, False, False),
+ (1, 192, 17, 320, 3, 2, "SAME", 1, False, False),
+ (1, 192, 17, 192, 3, 2, "SAME", 1, False, False),
+ (1, 1280, 8, 192, 1, 1, "SAME", 1, False, False),
+ (1, 1280, 8, 384, 1, 1, "SAME", 1, False, False),
+ (1, 1280, 8, 320, 1, 1, "SAME", 1, False, False),
+ (1, 1280, 8, 448, 1, 1, "SAME", 1, False, False),
+ (1, 384, 8, 384, 1, 1, "SAME", 1, False, False),
+ (1, 384, 8, 384, 3, 1, "SAME", 1, False, False),
+ (1, 448, 8, 384, 3, 1, "VALID", 1, False, False),
+ (1, 2048, 8, 320, 1, 1, "SAME", 1, False, False),
+ (1, 2048, 8, 448, 1, 1, "SAME", 1, True, True),
+ (1, 2048, 8, 192, 1, 1, "SAME", 1, True, False),
+ # A trouble case for native schedule
+ (1, 8, 1, 24, 1, 1, "SAME", 1, False, False),
+ ],
+)
+def test_conv2d_NHWC_gemm_int8(params, device):
- C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
- if add_bias:
- C = topi.add(C, bias)
- if add_relu:
- C = topi.nn.relu(C)
- s = schedule([C])
+ with Int8Fallback():
+ target, compute, schedule = device
- if add_bias:
- tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func = tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%dnnn_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- else:
- func = tvm.build(
- s,
- [A, W, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+
+ # TODO(ekalda): These combinations hang during compilation
+ failing_cases = [
+ (devices[1], (1, 128, 17, 192, 7, 1, "SAME", 2, False, False)),
+ (devices[1], (1, 160, 17, 160, 7, 1, "SAME", 1, False, False)),
+ (
+ devices[1],
+ (1, 448, 8, 384, 3, 1, "VALID", 1, False, False),
+ ), # this one passes but is just incredibly slow
+ ]
+ if (device, params) in failing_cases:
+ return
Review Comment:
I added `pytest.skip` for these cases so that they will show up as skipped tests in pytest log
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] ekalda commented on a diff in pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
ekalda commented on code in PR #13669:
URL: https://github.com/apache/tvm/pull/13669#discussion_r1058904002
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] Mousius commented on a diff in pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
Mousius commented on code in PR #13669:
URL: https://github.com/apache/tvm/pull/13669#discussion_r1058354955
##########
python/tvm/topi/nn/conv2d.py:
##########
@@ -606,8 +606,8 @@ def conv2d_gemm_weight_transform(kernel, tile_rows, tile_cols):
if N % tile_rows != 0:
pad_N = tile_rows - (N % tile_rows)
- if K % tile_cols != 0:
- pad_K = tile_cols - (K % tile_cols)
+ if K % (tile_cols * 4) != 0:
+ pad_K = (tile_cols * 4) - (K % (tile_cols * 4))
Review Comment:
We could maybe spell this out in code to make it clearer for people coming to it later 😸
I think this is roughly what's happening here?
```suggestion
tile_size = 4
untiled_cols = tile_cols * tile_size
misaligned_K = K % untiled_cols
if misaligned_K != 0:
pad_K = untiled_cols - misaligned_K
```
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -35,261 +35,138 @@
import platform
-def compile_conv2d_NHWC_gemm_int8_arm(
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
- A = te.placeholder((batch, in_height, in_width, in_channel), name="A", dtype="int8")
- W = te.placeholder((kernel, kernel, in_channel, num_filter), name="W", dtype="int8")
- bias = te.placeholder((num_filter,), name="bias", dtype="int8")
- dtype = "int32"
- devices = [
- (
- "llvm --device arm_cpu --mtriple aarch64-linux-gnu",
- topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
- topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
- ),
- (
- "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+dotprod",
- topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
- topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
- ),
- (
- "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+dotprod",
- topi.arm_cpu.compute_conv2d_NHWC_quantized_native,
- topi.arm_cpu.schedule_conv2d_NHWC_quantized_native,
- ),
- # TODO(giuseros) Need LLVM-11 in order to compile with +i8mm extension
- # (
- # "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+i8mm",
- # topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
- # topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
- # ),
- ]
-
- for device_tuple in devices:
- target = device_tuple[0]
- compute = device_tuple[1]
- schedule = device_tuple[2]
-
- dev = tvm.device(target, 0)
- if not tvm.testing.device_enabled(target):
- print("Skip because %s is not enabled" % target)
- return
- print("Compiling on arm AArch64 target: %s" % target)
- with tvm.target.Target(target) as tvm_target:
- assert tvm_target.features.is_aarch64, "AArch64 target not recognized"
+devices = [
+ (
+ "llvm",
+ topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
+ topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
+ ),
+ (
+ "llvm --device arm_cpu --mtriple aarch64-linux-gnu",
+ topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
+ topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
+ ),
+ (
+ "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+dotprod",
+ topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
+ topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
+ ),
+ (
+ "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+dotprod",
+ topi.arm_cpu.compute_conv2d_NHWC_quantized_native,
+ topi.arm_cpu.schedule_conv2d_NHWC_quantized_native,
+ ),
+ # TODO(giuseros) We need LLVM-11 in order to compile with +i8mm extension
+ # (
+ # "llvm --device arm_cpu --mtriple aarch64-linux-gnu -mattr=+v8.2a,+i8mm",
+ # topi.arm_cpu.compute_conv2d_NHWC_quantized_interleaved,
+ # topi.arm_cpu.schedule_conv2d_NHWC_quantized_interleaved,
+ # ),
+]
+
+
+@tvm.testing.requires_llvm
+@pytest.mark.parametrize("device", devices)
+@pytest.mark.parametrize(
+ "params",
+ [
+ # Subset of inception v3 expanded (dilation > 1, batch > 1, 'VALID' padding)
+ (1, 3, 299, 32, 3, 2, "SAME", 1, False, False),
+ (1, 32, 149, 32, 3, 1, "SAME", 2, False, False),
+ (4, 32, 147, 64, 3, 1, "SAME", 1, False, False),
+ (1, 64, 73, 80, 1, 1, "SAME", 1, False, False),
+ (1, 80, 73, 192, 3, 1, "SAME", 1, False, False),
+ (1, 192, 35, 48, 1, 1, "SAME", 1, False, False),
+ (1, 192, 35, 64, 1, 1, "VALID", 1, False, False),
+ (1, 192, 35, 32, 1, 1, "SAME", 1, False, False),
+ (1, 48, 35, 64, 5, 1, "SAME", 1, False, False),
+ (1, 96, 35, 96, 3, 1, "SAME", 1, False, False),
+ (1, 256, 35, 48, 1, 1, "SAME", 1, False, False),
+ (1, 256, 35, 64, 1, 1, "SAME", 1, False, False),
+ (1, 288, 35, 64, 1, 1, "SAME", 1, False, False),
+ (1, 288, 35, 48, 1, 1, "SAME", 1, False, False),
+ (1, 96, 35, 96, 3, 2, "SAME", 1, False, False),
+ (1, 128, 17, 192, 7, 1, "SAME", 2, False, False),
+ (1, 160, 17, 160, 7, 1, "SAME", 1, False, False),
+ (1, 160, 17, 192, 1, 1, "VALID", 1, False, False),
+ (1, 192, 17, 192, 1, 1, "SAME", 1, False, False),
+ (1, 768, 5, 128, 1, 1, "SAME", 1, False, False),
+ (1, 192, 17, 320, 3, 2, "SAME", 1, False, False),
+ (1, 192, 17, 192, 3, 2, "SAME", 1, False, False),
+ (1, 1280, 8, 192, 1, 1, "SAME", 1, False, False),
+ (1, 1280, 8, 384, 1, 1, "SAME", 1, False, False),
+ (1, 1280, 8, 320, 1, 1, "SAME", 1, False, False),
+ (1, 1280, 8, 448, 1, 1, "SAME", 1, False, False),
+ (1, 384, 8, 384, 1, 1, "SAME", 1, False, False),
+ (1, 384, 8, 384, 3, 1, "SAME", 1, False, False),
+ (1, 448, 8, 384, 3, 1, "VALID", 1, False, False),
+ (1, 2048, 8, 320, 1, 1, "SAME", 1, False, False),
+ (1, 2048, 8, 448, 1, 1, "SAME", 1, True, True),
+ (1, 2048, 8, 192, 1, 1, "SAME", 1, True, False),
+ # A trouble case for native schedule
+ (1, 8, 1, 24, 1, 1, "SAME", 1, False, False),
+ ],
+)
+def test_conv2d_NHWC_gemm_int8(params, device):
- C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
- if add_bias:
- C = topi.add(C, bias)
- if add_relu:
- C = topi.nn.relu(C)
- s = schedule([C])
+ with Int8Fallback():
+ target, compute, schedule = device
- if add_bias:
- tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func = tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%dnnn_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- else:
- func = tvm.build(
- s,
- [A, W, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+
+ # TODO(ekalda): These combinations hang during compilation
+ failing_cases = [
+ (devices[1], (1, 128, 17, 192, 7, 1, "SAME", 2, False, False)),
+ (devices[1], (1, 160, 17, 160, 7, 1, "SAME", 1, False, False)),
+ (
+ devices[1],
+ (1, 448, 8, 384, 3, 1, "VALID", 1, False, False),
+ ), # this one passes but is just incredibly slow
+ ]
+ if (device, params) in failing_cases:
+ return
Review Comment:
This will make it look like the test passed rather than is skipped, we should add these as parameters and expect a failure or mark it as a slow test so that future generations can see it from the pytest output:
```
[
... other cases ...,
pytest.param(devices[1], (1, 128, 17, 192, 7, 1, "SAME", 2, False, False), mark=pytest.mark.xfail),
pytest.param(devices[1], (1, 160, 17, 160, 7, 1, "SAME", 1, False, False), mark=pytest.mark.xfail),
pytest.param(devices[1], (1, 448, 8, 384, 3, 1, "VALID", 1, False, False), mark=tvm.testing.slow),
]
```
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
Review Comment:
Would be good to inform the test runner about this, even better would be to mark these as skipped earlier in the parameterize.
```suggestion
pytest.skip(reason="Skip because %s is not enabled" % target)
```
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ bias = te.placeholder(
+ (num_filter // oc_block_factor, 1, 1, oc_block_factor), name="bias", dtype=out_dtype
+ )
+ bias_shape = get_const_tuple(bias.shape)
- def check_target(target):
- dev = tvm.device(target, 0)
- if not tvm.testing.device_enabled(target):
- print("Skip because %s is not enabled" % target)
- return
- if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
- print("Skip because int8 intrinsics are not available")
- return
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=lo, high=hi, size=a_shape).astype(out_dtype)
+ w_np = np.random.randint(low=lo, high=hi, size=w_shape).astype(out_dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(
+ out_dtype
+ )
+
+ # convert to NCHWc
+ _, _, out_height, out_width = c_np.shape
+ c_np = c_np.reshape(
+ (batch, num_filter // oc_block_factor, oc_block_factor, out_height, out_width)
+ ).transpose(0, 1, 3, 4, 2)
+
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+
+ return a_np, w_np, b_np, c_np
+
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ with tvm.target.Target(target):
+ C = compute(
+ A,
+ W,
+ (stride, stride),
+ padding,
+ (dilation, dilation),
+ "NCHW",
+ "NCHW",
+ out_dtype,
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = schedule([C])
+
+ a = tvm.nd.array(a_np.astype(dtype), dev)
+ w = tvm.nd.array(w_np.astype(dtype), dev)
+ b = tvm.nd.array(b_np.astype(out_dtype), dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- print("Running on target: %s" % target)
- with tvm.target.Target(target):
- C = topi.cuda.conv2d_nchw_int8(
- A, W, (stride, stride), padding, (dilation, dilation), dtype
- )
if add_bias:
- C = topi.add(C, bias)
- if add_relu:
- C = topi.nn.relu(C)
- s = topi.cuda.schedule_conv2d_nchw_int8([C])
-
- a = tvm.nd.array(a_np, dev)
- w = tvm.nd.array(w_np, dev)
- b = tvm.nd.array(b_np, dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if add_bias:
- tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func = tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func(a, w, b, c)
- else:
+ compile_args = [A, W, bias, C]
+ run_args = [a, w, b, c]
+ else:
+ compile_args = [A, W, C]
+ run_args = [a, w, c]
+
func = tvm.build(
s,
- [A, W, C],
+ compile_args,
target,
name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
% (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
)
- func(a, w, c)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- verify_workload_padding()
+ if build_only:
+ return
- for target in ["cuda"]:
- check_target(target)
+ print("Running on target: %s" % target)
+ func(*run_args)
-@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
-def test_conv2d_nchw(in_dtype):
- with Int8Fallback():
- # ResNet18 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 128, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 256, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 7, 512, 3, 1, 1)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- # bias, relu
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True, add_relu=True)
+ targets = [
+ (
+ "cuda",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ # Disable on CI since it does not support spirv int8 dot product
+ # (
+ # "vulkan -from_device=0",
+ # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ # topi.cuda.schedule_conv2d_NCHWc_int8,
+ # 4,
+ # False,
+ # ),
+ ]
- # dilation = 2
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
+ build_only_aarch64 = platform.machine() != "aarch64"
- # batch size
- verify_conv2d_NCHWc_int8(in_dtype, 4, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
+ targets.append(
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ )
+ )
- # weird workloads
- verify_conv2d_NCHWc_int8(in_dtype, 4, 4, 4, 8, 4, 4, 4)
+ if in_dtype == "int8":
+ targets += [
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ ),
+ (
+ "rocm -mattr=+dotprod",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(
+ a, w, s, p, d, l, o
+ ),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ ]
+
+ for target, compute, schedule, oc_block_factor, build_only in targets:
+ check_target(target, compute, schedule, oc_block_factor, build_only)
+
+
+# Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
+# performing basic testing - one test for all different scenarios - batch, dilation etc..
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (4, 4, 4, 4, 4, 4, 4, 1, False, False),
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 2, 2), 1, False, False),
+ ],
+)
+def test_conv2d_nchw_int8(in_dtype, params):
+ with Int8Fallback():
+ (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
+ )
- # inception v3 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 147, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 73, 80, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 80, 73, 192, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 48, 35, 64, 5, 1, 2)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 32, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 384, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 320, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 448, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1024, 19, 88, 3, 1, 1)
+ in_height = in_width = in_size
- # batch > 1
- verify_conv2d_NCHWc_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 8, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 32, 32, 149, 32, 3, 1, 0)
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+ bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
- # Asymmetric padding
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 128, 3, 1, (3, 3, 2, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, (1, 2, 2, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 17, 192, 1, 1, (1, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (3, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 8, 384, 3, 1, (0, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 392, 8, 64, 3, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 19, 64, 1, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 16, 32, 2, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (1, 2, 2, 1), add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 5, 2, (1, 3), add_bias=True)
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 3, 1, "VALID", add_bias=True, add_relu=True
- )
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 24, 1, "SAME", add_bias=True, add_relu=True
- )
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ bias_shape = get_const_tuple(bias.shape)
+ dtype = A.dtype
+
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
+ w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
- # Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
- # performing basic testing - one test for all different scenarios - batch, dilation etc..
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
- verify_conv2d_nchw_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 4, 4, 4, 4, 4, 4, 4)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 2, 2))
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+ return a_np, w_np, b_np, c_np
-def test_conv2d_nhwc():
- with Int8Fallback():
- # Subset of inception v3 expanded (dilation > 1, batch > 1, 'VALID' padding)
- verify_conv2d_NHWC_gemm_int8(1, 3, 299, 32, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 32, 149, 32, 3, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(4, 32, 147, 64, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 64, 73, 80, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 80, 73, 192, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 64, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 32, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 48, 35, 64, 5, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 128, 17, 192, 7, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 160, 7, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 192, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 768, 5, 128, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 320, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 448, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 448, 8, 384, 3, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 448, 1, 1, "SAME", add_bias=True, add_relu=True)
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 192, 1, 1, "SAME", add_bias=True)
-
- # Let's also verify that it compiles fine on AArch64 targets
- compile_conv2d_NHWC_gemm_int8_arm(1, 3, 299, 32, 3, 2, "SAME")
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ def verify_workload_padding():
+ _, _, out_height, out_width = get_const_tuple(c_np.shape)
+ wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
+
+ # for testing functionality,
+ # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
+ # regardless of the performance.
+ int32_lanes, num_int8_elements = num_filter, in_channel
+
+ # check if tile_ow candidates are the factors of the right output weight.
+ cfg = autotvm.get_config()
+ fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
+ ow_tile = np.prod(cfg["tile_ow"].size)
+
+ tvm.testing.assert_allclose(ow_tile, out_width)
+
+ def check_target(target):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
Review Comment:
`pytest.skip` as above, can we re-use the same function by hoisting it out of the test?
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ bias = te.placeholder(
+ (num_filter // oc_block_factor, 1, 1, oc_block_factor), name="bias", dtype=out_dtype
+ )
+ bias_shape = get_const_tuple(bias.shape)
- def check_target(target):
- dev = tvm.device(target, 0)
- if not tvm.testing.device_enabled(target):
- print("Skip because %s is not enabled" % target)
- return
- if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
- print("Skip because int8 intrinsics are not available")
- return
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=lo, high=hi, size=a_shape).astype(out_dtype)
+ w_np = np.random.randint(low=lo, high=hi, size=w_shape).astype(out_dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(
+ out_dtype
+ )
+
+ # convert to NCHWc
+ _, _, out_height, out_width = c_np.shape
+ c_np = c_np.reshape(
+ (batch, num_filter // oc_block_factor, oc_block_factor, out_height, out_width)
+ ).transpose(0, 1, 3, 4, 2)
+
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+
+ return a_np, w_np, b_np, c_np
+
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ with tvm.target.Target(target):
+ C = compute(
+ A,
+ W,
+ (stride, stride),
+ padding,
+ (dilation, dilation),
+ "NCHW",
+ "NCHW",
+ out_dtype,
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = schedule([C])
+
+ a = tvm.nd.array(a_np.astype(dtype), dev)
+ w = tvm.nd.array(w_np.astype(dtype), dev)
+ b = tvm.nd.array(b_np.astype(out_dtype), dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- print("Running on target: %s" % target)
- with tvm.target.Target(target):
- C = topi.cuda.conv2d_nchw_int8(
- A, W, (stride, stride), padding, (dilation, dilation), dtype
- )
if add_bias:
- C = topi.add(C, bias)
- if add_relu:
- C = topi.nn.relu(C)
- s = topi.cuda.schedule_conv2d_nchw_int8([C])
-
- a = tvm.nd.array(a_np, dev)
- w = tvm.nd.array(w_np, dev)
- b = tvm.nd.array(b_np, dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if add_bias:
- tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func = tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func(a, w, b, c)
- else:
+ compile_args = [A, W, bias, C]
+ run_args = [a, w, b, c]
+ else:
+ compile_args = [A, W, C]
+ run_args = [a, w, c]
+
func = tvm.build(
s,
- [A, W, C],
+ compile_args,
target,
name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
% (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
)
- func(a, w, c)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- verify_workload_padding()
+ if build_only:
+ return
- for target in ["cuda"]:
- check_target(target)
+ print("Running on target: %s" % target)
+ func(*run_args)
-@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
-def test_conv2d_nchw(in_dtype):
- with Int8Fallback():
- # ResNet18 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 128, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 256, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 7, 512, 3, 1, 1)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- # bias, relu
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True, add_relu=True)
+ targets = [
+ (
+ "cuda",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ # Disable on CI since it does not support spirv int8 dot product
+ # (
+ # "vulkan -from_device=0",
+ # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ # topi.cuda.schedule_conv2d_NCHWc_int8,
+ # 4,
+ # False,
+ # ),
+ ]
- # dilation = 2
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
+ build_only_aarch64 = platform.machine() != "aarch64"
- # batch size
- verify_conv2d_NCHWc_int8(in_dtype, 4, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
+ targets.append(
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ )
+ )
- # weird workloads
- verify_conv2d_NCHWc_int8(in_dtype, 4, 4, 4, 8, 4, 4, 4)
+ if in_dtype == "int8":
+ targets += [
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ ),
+ (
+ "rocm -mattr=+dotprod",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(
+ a, w, s, p, d, l, o
+ ),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ ]
+
+ for target, compute, schedule, oc_block_factor, build_only in targets:
+ check_target(target, compute, schedule, oc_block_factor, build_only)
+
+
+# Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
+# performing basic testing - one test for all different scenarios - batch, dilation etc..
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (4, 4, 4, 4, 4, 4, 4, 1, False, False),
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 2, 2), 1, False, False),
+ ],
+)
+def test_conv2d_nchw_int8(in_dtype, params):
+ with Int8Fallback():
+ (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
+ )
- # inception v3 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 147, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 73, 80, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 80, 73, 192, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 48, 35, 64, 5, 1, 2)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 32, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 384, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 320, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 448, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1024, 19, 88, 3, 1, 1)
+ in_height = in_width = in_size
- # batch > 1
- verify_conv2d_NCHWc_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 8, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 32, 32, 149, 32, 3, 1, 0)
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+ bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
- # Asymmetric padding
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 128, 3, 1, (3, 3, 2, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, (1, 2, 2, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 17, 192, 1, 1, (1, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (3, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 8, 384, 3, 1, (0, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 392, 8, 64, 3, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 19, 64, 1, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 16, 32, 2, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (1, 2, 2, 1), add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 5, 2, (1, 3), add_bias=True)
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 3, 1, "VALID", add_bias=True, add_relu=True
- )
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 24, 1, "SAME", add_bias=True, add_relu=True
- )
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ bias_shape = get_const_tuple(bias.shape)
+ dtype = A.dtype
+
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
+ w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
- # Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
- # performing basic testing - one test for all different scenarios - batch, dilation etc..
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
- verify_conv2d_nchw_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 4, 4, 4, 4, 4, 4, 4)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 2, 2))
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+ return a_np, w_np, b_np, c_np
-def test_conv2d_nhwc():
- with Int8Fallback():
- # Subset of inception v3 expanded (dilation > 1, batch > 1, 'VALID' padding)
- verify_conv2d_NHWC_gemm_int8(1, 3, 299, 32, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 32, 149, 32, 3, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(4, 32, 147, 64, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 64, 73, 80, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 80, 73, 192, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 64, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 32, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 48, 35, 64, 5, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 128, 17, 192, 7, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 160, 7, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 192, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 768, 5, 128, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 320, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 448, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 448, 8, 384, 3, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 448, 1, 1, "SAME", add_bias=True, add_relu=True)
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 192, 1, 1, "SAME", add_bias=True)
-
- # Let's also verify that it compiles fine on AArch64 targets
- compile_conv2d_NHWC_gemm_int8_arm(1, 3, 299, 32, 3, 2, "SAME")
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ def verify_workload_padding():
+ _, _, out_height, out_width = get_const_tuple(c_np.shape)
+ wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
+
+ # for testing functionality,
+ # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
+ # regardless of the performance.
+ int32_lanes, num_int8_elements = num_filter, in_channel
+
+ # check if tile_ow candidates are the factors of the right output weight.
+ cfg = autotvm.get_config()
+ fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
+ ow_tile = np.prod(cfg["tile_ow"].size)
+
+ tvm.testing.assert_allclose(ow_tile, out_width)
+
+ def check_target(target):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ print("Running on target: %s" % target)
+ with tvm.target.Target(target):
+ C = topi.cuda.conv2d_nchw_int8(
+ A, W, (stride, stride), padding, (dilation, dilation), dtype
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = topi.cuda.schedule_conv2d_nchw_int8([C])
+
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
+ if add_bias:
+ func = tvm.build(
+ s,
+ [A, W, bias, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, b, c)
+ else:
+ func = tvm.build(
+ s,
+ [A, W, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, c)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
+
+ verify_workload_padding()
+
+ for target in ["cuda"]:
+ check_target(target)
Review Comment:
```suggestion
check_target("cuda")
```
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
Review Comment:
```suggestion
pytest.skip(reason="Skip because int8 intrinsics are not available")
```
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
Review Comment:
I think we can use https://github.com/apache/tvm/blob/main/python/tvm/testing/aot.py#L904 here, we can move it out of `aot.py` later. I think we can also be a bit clearer with our variable naming
```suggestion
input_min, input_max = get_dtype_range(in_dtype)
```
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ bias = te.placeholder(
+ (num_filter // oc_block_factor, 1, 1, oc_block_factor), name="bias", dtype=out_dtype
+ )
+ bias_shape = get_const_tuple(bias.shape)
- def check_target(target):
- dev = tvm.device(target, 0)
- if not tvm.testing.device_enabled(target):
- print("Skip because %s is not enabled" % target)
- return
- if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
- print("Skip because int8 intrinsics are not available")
- return
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=lo, high=hi, size=a_shape).astype(out_dtype)
+ w_np = np.random.randint(low=lo, high=hi, size=w_shape).astype(out_dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(
+ out_dtype
+ )
+
+ # convert to NCHWc
+ _, _, out_height, out_width = c_np.shape
+ c_np = c_np.reshape(
+ (batch, num_filter // oc_block_factor, oc_block_factor, out_height, out_width)
+ ).transpose(0, 1, 3, 4, 2)
+
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+
+ return a_np, w_np, b_np, c_np
+
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ with tvm.target.Target(target):
+ C = compute(
+ A,
+ W,
+ (stride, stride),
+ padding,
+ (dilation, dilation),
+ "NCHW",
+ "NCHW",
+ out_dtype,
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = schedule([C])
+
+ a = tvm.nd.array(a_np.astype(dtype), dev)
+ w = tvm.nd.array(w_np.astype(dtype), dev)
+ b = tvm.nd.array(b_np.astype(out_dtype), dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- print("Running on target: %s" % target)
- with tvm.target.Target(target):
- C = topi.cuda.conv2d_nchw_int8(
- A, W, (stride, stride), padding, (dilation, dilation), dtype
- )
if add_bias:
- C = topi.add(C, bias)
- if add_relu:
- C = topi.nn.relu(C)
- s = topi.cuda.schedule_conv2d_nchw_int8([C])
-
- a = tvm.nd.array(a_np, dev)
- w = tvm.nd.array(w_np, dev)
- b = tvm.nd.array(b_np, dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if add_bias:
- tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func = tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func(a, w, b, c)
- else:
+ compile_args = [A, W, bias, C]
+ run_args = [a, w, b, c]
+ else:
+ compile_args = [A, W, C]
+ run_args = [a, w, c]
+
func = tvm.build(
s,
- [A, W, C],
+ compile_args,
target,
name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
% (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
)
- func(a, w, c)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- verify_workload_padding()
+ if build_only:
+ return
- for target in ["cuda"]:
- check_target(target)
+ print("Running on target: %s" % target)
+ func(*run_args)
-@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
-def test_conv2d_nchw(in_dtype):
- with Int8Fallback():
- # ResNet18 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 128, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 256, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 7, 512, 3, 1, 1)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- # bias, relu
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True, add_relu=True)
+ targets = [
+ (
+ "cuda",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ # Disable on CI since it does not support spirv int8 dot product
+ # (
+ # "vulkan -from_device=0",
+ # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ # topi.cuda.schedule_conv2d_NCHWc_int8,
+ # 4,
+ # False,
+ # ),
+ ]
- # dilation = 2
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
+ build_only_aarch64 = platform.machine() != "aarch64"
- # batch size
- verify_conv2d_NCHWc_int8(in_dtype, 4, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
+ targets.append(
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ )
+ )
- # weird workloads
- verify_conv2d_NCHWc_int8(in_dtype, 4, 4, 4, 8, 4, 4, 4)
+ if in_dtype == "int8":
+ targets += [
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ ),
+ (
+ "rocm -mattr=+dotprod",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(
+ a, w, s, p, d, l, o
+ ),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ ]
+
+ for target, compute, schedule, oc_block_factor, build_only in targets:
+ check_target(target, compute, schedule, oc_block_factor, build_only)
+
+
+# Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
+# performing basic testing - one test for all different scenarios - batch, dilation etc..
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (4, 4, 4, 4, 4, 4, 4, 1, False, False),
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 2, 2), 1, False, False),
+ ],
+)
+def test_conv2d_nchw_int8(in_dtype, params):
+ with Int8Fallback():
+ (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
+ )
- # inception v3 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 147, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 73, 80, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 80, 73, 192, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 48, 35, 64, 5, 1, 2)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 32, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 384, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 320, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 448, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1024, 19, 88, 3, 1, 1)
+ in_height = in_width = in_size
- # batch > 1
- verify_conv2d_NCHWc_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 8, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 32, 32, 149, 32, 3, 1, 0)
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+ bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
- # Asymmetric padding
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 128, 3, 1, (3, 3, 2, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, (1, 2, 2, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 17, 192, 1, 1, (1, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (3, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 8, 384, 3, 1, (0, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 392, 8, 64, 3, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 19, 64, 1, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 16, 32, 2, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (1, 2, 2, 1), add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 5, 2, (1, 3), add_bias=True)
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 3, 1, "VALID", add_bias=True, add_relu=True
- )
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 24, 1, "SAME", add_bias=True, add_relu=True
- )
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ bias_shape = get_const_tuple(bias.shape)
+ dtype = A.dtype
+
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
+ w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
- # Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
- # performing basic testing - one test for all different scenarios - batch, dilation etc..
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
- verify_conv2d_nchw_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 4, 4, 4, 4, 4, 4, 4)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 2, 2))
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+ return a_np, w_np, b_np, c_np
-def test_conv2d_nhwc():
- with Int8Fallback():
- # Subset of inception v3 expanded (dilation > 1, batch > 1, 'VALID' padding)
- verify_conv2d_NHWC_gemm_int8(1, 3, 299, 32, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 32, 149, 32, 3, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(4, 32, 147, 64, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 64, 73, 80, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 80, 73, 192, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 64, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 32, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 48, 35, 64, 5, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 128, 17, 192, 7, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 160, 7, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 192, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 768, 5, 128, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 320, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 448, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 448, 8, 384, 3, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 448, 1, 1, "SAME", add_bias=True, add_relu=True)
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 192, 1, 1, "SAME", add_bias=True)
-
- # Let's also verify that it compiles fine on AArch64 targets
- compile_conv2d_NHWC_gemm_int8_arm(1, 3, 299, 32, 3, 2, "SAME")
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ def verify_workload_padding():
+ _, _, out_height, out_width = get_const_tuple(c_np.shape)
+ wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
+
+ # for testing functionality,
+ # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
+ # regardless of the performance.
+ int32_lanes, num_int8_elements = num_filter, in_channel
+
+ # check if tile_ow candidates are the factors of the right output weight.
+ cfg = autotvm.get_config()
+ fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
+ ow_tile = np.prod(cfg["tile_ow"].size)
+
+ tvm.testing.assert_allclose(ow_tile, out_width)
+
+ def check_target(target):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ print("Running on target: %s" % target)
+ with tvm.target.Target(target):
+ C = topi.cuda.conv2d_nchw_int8(
+ A, W, (stride, stride), padding, (dilation, dilation), dtype
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = topi.cuda.schedule_conv2d_nchw_int8([C])
+
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
+ if add_bias:
+ func = tvm.build(
+ s,
+ [A, W, bias, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, b, c)
+ else:
+ func = tvm.build(
+ s,
+ [A, W, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, c)
Review Comment:
I think only a few things change here, so we can reduce the amount of duplication.
```suggestion
build_args = [A, W, bias, C] if add_bias else [A, W, C]
func = tvm.build(
s,
build_args,
target,
name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
% (
batch,
in_channel,
in_size,
num_filter,
kernel,
stride,
padding_sum,
dilation,
),
)
if add_bias:
func(a, w, b, c)
else:
func(a, w, c)
```
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ bias = te.placeholder(
+ (num_filter // oc_block_factor, 1, 1, oc_block_factor), name="bias", dtype=out_dtype
+ )
+ bias_shape = get_const_tuple(bias.shape)
- def check_target(target):
- dev = tvm.device(target, 0)
- if not tvm.testing.device_enabled(target):
- print("Skip because %s is not enabled" % target)
- return
- if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
- print("Skip because int8 intrinsics are not available")
- return
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=lo, high=hi, size=a_shape).astype(out_dtype)
+ w_np = np.random.randint(low=lo, high=hi, size=w_shape).astype(out_dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(
+ out_dtype
+ )
+
+ # convert to NCHWc
+ _, _, out_height, out_width = c_np.shape
+ c_np = c_np.reshape(
+ (batch, num_filter // oc_block_factor, oc_block_factor, out_height, out_width)
+ ).transpose(0, 1, 3, 4, 2)
+
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+
+ return a_np, w_np, b_np, c_np
+
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ with tvm.target.Target(target):
+ C = compute(
+ A,
+ W,
+ (stride, stride),
+ padding,
+ (dilation, dilation),
+ "NCHW",
+ "NCHW",
+ out_dtype,
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = schedule([C])
+
+ a = tvm.nd.array(a_np.astype(dtype), dev)
+ w = tvm.nd.array(w_np.astype(dtype), dev)
+ b = tvm.nd.array(b_np.astype(out_dtype), dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- print("Running on target: %s" % target)
- with tvm.target.Target(target):
- C = topi.cuda.conv2d_nchw_int8(
- A, W, (stride, stride), padding, (dilation, dilation), dtype
- )
if add_bias:
- C = topi.add(C, bias)
- if add_relu:
- C = topi.nn.relu(C)
- s = topi.cuda.schedule_conv2d_nchw_int8([C])
-
- a = tvm.nd.array(a_np, dev)
- w = tvm.nd.array(w_np, dev)
- b = tvm.nd.array(b_np, dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if add_bias:
- tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func = tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func(a, w, b, c)
- else:
+ compile_args = [A, W, bias, C]
+ run_args = [a, w, b, c]
+ else:
+ compile_args = [A, W, C]
+ run_args = [a, w, c]
+
func = tvm.build(
s,
- [A, W, C],
+ compile_args,
target,
name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
% (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
)
- func(a, w, c)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- verify_workload_padding()
+ if build_only:
+ return
- for target in ["cuda"]:
- check_target(target)
+ print("Running on target: %s" % target)
+ func(*run_args)
-@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
-def test_conv2d_nchw(in_dtype):
- with Int8Fallback():
- # ResNet18 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 128, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 256, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 7, 512, 3, 1, 1)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- # bias, relu
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True, add_relu=True)
+ targets = [
+ (
+ "cuda",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ # Disable on CI since it does not support spirv int8 dot product
+ # (
+ # "vulkan -from_device=0",
+ # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ # topi.cuda.schedule_conv2d_NCHWc_int8,
+ # 4,
+ # False,
+ # ),
+ ]
- # dilation = 2
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
+ build_only_aarch64 = platform.machine() != "aarch64"
- # batch size
- verify_conv2d_NCHWc_int8(in_dtype, 4, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
+ targets.append(
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ )
+ )
- # weird workloads
- verify_conv2d_NCHWc_int8(in_dtype, 4, 4, 4, 8, 4, 4, 4)
+ if in_dtype == "int8":
+ targets += [
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ ),
+ (
+ "rocm -mattr=+dotprod",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(
+ a, w, s, p, d, l, o
+ ),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ ]
+
+ for target, compute, schedule, oc_block_factor, build_only in targets:
+ check_target(target, compute, schedule, oc_block_factor, build_only)
+
+
+# Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
+# performing basic testing - one test for all different scenarios - batch, dilation etc..
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (4, 4, 4, 4, 4, 4, 4, 1, False, False),
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 2, 2), 1, False, False),
+ ],
+)
+def test_conv2d_nchw_int8(in_dtype, params):
+ with Int8Fallback():
+ (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
+ )
- # inception v3 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 147, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 73, 80, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 80, 73, 192, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 48, 35, 64, 5, 1, 2)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 32, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 384, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 320, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 448, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1024, 19, 88, 3, 1, 1)
+ in_height = in_width = in_size
- # batch > 1
- verify_conv2d_NCHWc_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 8, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 32, 32, 149, 32, 3, 1, 0)
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+ bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
- # Asymmetric padding
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 128, 3, 1, (3, 3, 2, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, (1, 2, 2, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 17, 192, 1, 1, (1, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (3, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 8, 384, 3, 1, (0, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 392, 8, 64, 3, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 19, 64, 1, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 16, 32, 2, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (1, 2, 2, 1), add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 5, 2, (1, 3), add_bias=True)
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 3, 1, "VALID", add_bias=True, add_relu=True
- )
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 24, 1, "SAME", add_bias=True, add_relu=True
- )
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ bias_shape = get_const_tuple(bias.shape)
+ dtype = A.dtype
+
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
+ w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
- # Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
- # performing basic testing - one test for all different scenarios - batch, dilation etc..
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
- verify_conv2d_nchw_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 4, 4, 4, 4, 4, 4, 4)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 2, 2))
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+ return a_np, w_np, b_np, c_np
-def test_conv2d_nhwc():
- with Int8Fallback():
- # Subset of inception v3 expanded (dilation > 1, batch > 1, 'VALID' padding)
- verify_conv2d_NHWC_gemm_int8(1, 3, 299, 32, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 32, 149, 32, 3, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(4, 32, 147, 64, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 64, 73, 80, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 80, 73, 192, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 64, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 32, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 48, 35, 64, 5, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 128, 17, 192, 7, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 160, 7, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 192, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 768, 5, 128, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 320, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 448, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 448, 8, 384, 3, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 448, 1, 1, "SAME", add_bias=True, add_relu=True)
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 192, 1, 1, "SAME", add_bias=True)
-
- # Let's also verify that it compiles fine on AArch64 targets
- compile_conv2d_NHWC_gemm_int8_arm(1, 3, 299, 32, 3, 2, "SAME")
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ def verify_workload_padding():
+ _, _, out_height, out_width = get_const_tuple(c_np.shape)
+ wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
+
+ # for testing functionality,
+ # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
+ # regardless of the performance.
+ int32_lanes, num_int8_elements = num_filter, in_channel
+
+ # check if tile_ow candidates are the factors of the right output weight.
+ cfg = autotvm.get_config()
+ fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
+ ow_tile = np.prod(cfg["tile_ow"].size)
+
+ tvm.testing.assert_allclose(ow_tile, out_width)
+
+ def check_target(target):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ print("Running on target: %s" % target)
+ with tvm.target.Target(target):
+ C = topi.cuda.conv2d_nchw_int8(
+ A, W, (stride, stride), padding, (dilation, dilation), dtype
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = topi.cuda.schedule_conv2d_nchw_int8([C])
+
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
+ if add_bias:
+ func = tvm.build(
+ s,
+ [A, W, bias, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, b, c)
+ else:
+ func = tvm.build(
+ s,
+ [A, W, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, c)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
+
+ verify_workload_padding()
+
+ for target in ["cuda"]:
+ check_target(target)
Review Comment:
```suggestion
check_target("cuda")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] ekalda commented on pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
ekalda commented on PR #13669:
URL: https://github.com/apache/tvm/pull/13669#issuecomment-1366519663
cc @leandron @Mousius
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] tvm-bot commented on pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
tvm-bot commented on PR #13669:
URL: https://github.com/apache/tvm/pull/13669#issuecomment-1366519043
<!---bot-comment-->
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from [Reviewers](https://github.com/apache/incubator-tvm/blob/master/CONTRIBUTORS.md#reviewers) by @-ing them in a comment.
<!--bot-comment-ccs-start-->
* No users to tag found in teams: `topi`, `bugfix` <sub>See [#10317](https://github.com/apache/tvm/issues/10317) for details</sub><!--bot-comment-ccs-end-->
<sub>Generated by [tvm-bot](https://github.com/apache/tvm/blob/main/ci/README.md#github-actions)</sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] ekalda commented on a diff in pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
ekalda commented on code in PR #13669:
URL: https://github.com/apache/tvm/pull/13669#discussion_r1058905667
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ bias = te.placeholder(
+ (num_filter // oc_block_factor, 1, 1, oc_block_factor), name="bias", dtype=out_dtype
+ )
+ bias_shape = get_const_tuple(bias.shape)
- def check_target(target):
- dev = tvm.device(target, 0)
- if not tvm.testing.device_enabled(target):
- print("Skip because %s is not enabled" % target)
- return
- if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
- print("Skip because int8 intrinsics are not available")
- return
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=lo, high=hi, size=a_shape).astype(out_dtype)
+ w_np = np.random.randint(low=lo, high=hi, size=w_shape).astype(out_dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(
+ out_dtype
+ )
+
+ # convert to NCHWc
+ _, _, out_height, out_width = c_np.shape
+ c_np = c_np.reshape(
+ (batch, num_filter // oc_block_factor, oc_block_factor, out_height, out_width)
+ ).transpose(0, 1, 3, 4, 2)
+
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+
+ return a_np, w_np, b_np, c_np
+
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ with tvm.target.Target(target):
+ C = compute(
+ A,
+ W,
+ (stride, stride),
+ padding,
+ (dilation, dilation),
+ "NCHW",
+ "NCHW",
+ out_dtype,
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = schedule([C])
+
+ a = tvm.nd.array(a_np.astype(dtype), dev)
+ w = tvm.nd.array(w_np.astype(dtype), dev)
+ b = tvm.nd.array(b_np.astype(out_dtype), dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- print("Running on target: %s" % target)
- with tvm.target.Target(target):
- C = topi.cuda.conv2d_nchw_int8(
- A, W, (stride, stride), padding, (dilation, dilation), dtype
- )
if add_bias:
- C = topi.add(C, bias)
- if add_relu:
- C = topi.nn.relu(C)
- s = topi.cuda.schedule_conv2d_nchw_int8([C])
-
- a = tvm.nd.array(a_np, dev)
- w = tvm.nd.array(w_np, dev)
- b = tvm.nd.array(b_np, dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if add_bias:
- tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func = tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func(a, w, b, c)
- else:
+ compile_args = [A, W, bias, C]
+ run_args = [a, w, b, c]
+ else:
+ compile_args = [A, W, C]
+ run_args = [a, w, c]
+
func = tvm.build(
s,
- [A, W, C],
+ compile_args,
target,
name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
% (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
)
- func(a, w, c)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- verify_workload_padding()
+ if build_only:
+ return
- for target in ["cuda"]:
- check_target(target)
+ print("Running on target: %s" % target)
+ func(*run_args)
-@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
-def test_conv2d_nchw(in_dtype):
- with Int8Fallback():
- # ResNet18 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 128, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 256, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 7, 512, 3, 1, 1)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- # bias, relu
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True, add_relu=True)
+ targets = [
+ (
+ "cuda",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ # Disable on CI since it does not support spirv int8 dot product
+ # (
+ # "vulkan -from_device=0",
+ # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ # topi.cuda.schedule_conv2d_NCHWc_int8,
+ # 4,
+ # False,
+ # ),
+ ]
- # dilation = 2
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
+ build_only_aarch64 = platform.machine() != "aarch64"
- # batch size
- verify_conv2d_NCHWc_int8(in_dtype, 4, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
+ targets.append(
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ )
+ )
- # weird workloads
- verify_conv2d_NCHWc_int8(in_dtype, 4, 4, 4, 8, 4, 4, 4)
+ if in_dtype == "int8":
+ targets += [
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ ),
+ (
+ "rocm -mattr=+dotprod",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(
+ a, w, s, p, d, l, o
+ ),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ ]
+
+ for target, compute, schedule, oc_block_factor, build_only in targets:
+ check_target(target, compute, schedule, oc_block_factor, build_only)
+
+
+# Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
+# performing basic testing - one test for all different scenarios - batch, dilation etc..
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (4, 4, 4, 4, 4, 4, 4, 1, False, False),
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 2, 2), 1, False, False),
+ ],
+)
+def test_conv2d_nchw_int8(in_dtype, params):
+ with Int8Fallback():
+ (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
+ )
- # inception v3 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 147, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 73, 80, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 80, 73, 192, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 48, 35, 64, 5, 1, 2)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 32, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 384, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 320, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 448, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1024, 19, 88, 3, 1, 1)
+ in_height = in_width = in_size
- # batch > 1
- verify_conv2d_NCHWc_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 8, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 32, 32, 149, 32, 3, 1, 0)
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+ bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
- # Asymmetric padding
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 128, 3, 1, (3, 3, 2, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, (1, 2, 2, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 17, 192, 1, 1, (1, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (3, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 8, 384, 3, 1, (0, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 392, 8, 64, 3, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 19, 64, 1, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 16, 32, 2, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (1, 2, 2, 1), add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 5, 2, (1, 3), add_bias=True)
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 3, 1, "VALID", add_bias=True, add_relu=True
- )
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 24, 1, "SAME", add_bias=True, add_relu=True
- )
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ bias_shape = get_const_tuple(bias.shape)
+ dtype = A.dtype
+
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
+ w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
- # Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
- # performing basic testing - one test for all different scenarios - batch, dilation etc..
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
- verify_conv2d_nchw_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 4, 4, 4, 4, 4, 4, 4)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 2, 2))
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+ return a_np, w_np, b_np, c_np
-def test_conv2d_nhwc():
- with Int8Fallback():
- # Subset of inception v3 expanded (dilation > 1, batch > 1, 'VALID' padding)
- verify_conv2d_NHWC_gemm_int8(1, 3, 299, 32, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 32, 149, 32, 3, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(4, 32, 147, 64, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 64, 73, 80, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 80, 73, 192, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 64, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 32, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 48, 35, 64, 5, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 128, 17, 192, 7, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 160, 7, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 192, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 768, 5, 128, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 320, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 448, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 448, 8, 384, 3, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 448, 1, 1, "SAME", add_bias=True, add_relu=True)
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 192, 1, 1, "SAME", add_bias=True)
-
- # Let's also verify that it compiles fine on AArch64 targets
- compile_conv2d_NHWC_gemm_int8_arm(1, 3, 299, 32, 3, 2, "SAME")
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ def verify_workload_padding():
+ _, _, out_height, out_width = get_const_tuple(c_np.shape)
+ wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
+
+ # for testing functionality,
+ # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
+ # regardless of the performance.
+ int32_lanes, num_int8_elements = num_filter, in_channel
+
+ # check if tile_ow candidates are the factors of the right output weight.
+ cfg = autotvm.get_config()
+ fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
+ ow_tile = np.prod(cfg["tile_ow"].size)
+
+ tvm.testing.assert_allclose(ow_tile, out_width)
+
+ def check_target(target):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
Review Comment:
I added the `pytest.skip`. I experimented with hoisting out the functions since all the tests in that file do something similar, but annoyingly the functions are all subtly different and depend on pretty much all the parameters passed to the test and defined in the test, also compute definitions, schedules, utility functions etc which would all need to be passed as arguments, so it didn't look like it was worth it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] ekalda commented on a diff in pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
ekalda commented on code in PR #13669:
URL: https://github.com/apache/tvm/pull/13669#discussion_r1058904165
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
Review Comment:
Good point! Done
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] Mousius commented on a diff in pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
Mousius commented on code in PR #13669:
URL: https://github.com/apache/tvm/pull/13669#discussion_r1058367284
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ bias = te.placeholder(
+ (num_filter // oc_block_factor, 1, 1, oc_block_factor), name="bias", dtype=out_dtype
+ )
+ bias_shape = get_const_tuple(bias.shape)
- def check_target(target):
- dev = tvm.device(target, 0)
- if not tvm.testing.device_enabled(target):
- print("Skip because %s is not enabled" % target)
- return
- if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
- print("Skip because int8 intrinsics are not available")
- return
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=lo, high=hi, size=a_shape).astype(out_dtype)
+ w_np = np.random.randint(low=lo, high=hi, size=w_shape).astype(out_dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(
+ out_dtype
+ )
+
+ # convert to NCHWc
+ _, _, out_height, out_width = c_np.shape
+ c_np = c_np.reshape(
+ (batch, num_filter // oc_block_factor, oc_block_factor, out_height, out_width)
+ ).transpose(0, 1, 3, 4, 2)
+
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+
+ return a_np, w_np, b_np, c_np
+
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ with tvm.target.Target(target):
+ C = compute(
+ A,
+ W,
+ (stride, stride),
+ padding,
+ (dilation, dilation),
+ "NCHW",
+ "NCHW",
+ out_dtype,
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = schedule([C])
+
+ a = tvm.nd.array(a_np.astype(dtype), dev)
+ w = tvm.nd.array(w_np.astype(dtype), dev)
+ b = tvm.nd.array(b_np.astype(out_dtype), dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- print("Running on target: %s" % target)
- with tvm.target.Target(target):
- C = topi.cuda.conv2d_nchw_int8(
- A, W, (stride, stride), padding, (dilation, dilation), dtype
- )
if add_bias:
- C = topi.add(C, bias)
- if add_relu:
- C = topi.nn.relu(C)
- s = topi.cuda.schedule_conv2d_nchw_int8([C])
-
- a = tvm.nd.array(a_np, dev)
- w = tvm.nd.array(w_np, dev)
- b = tvm.nd.array(b_np, dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if add_bias:
- tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func = tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func(a, w, b, c)
- else:
+ compile_args = [A, W, bias, C]
+ run_args = [a, w, b, c]
+ else:
+ compile_args = [A, W, C]
+ run_args = [a, w, c]
+
func = tvm.build(
s,
- [A, W, C],
+ compile_args,
target,
name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
% (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
)
- func(a, w, c)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- verify_workload_padding()
+ if build_only:
+ return
- for target in ["cuda"]:
- check_target(target)
+ print("Running on target: %s" % target)
+ func(*run_args)
-@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
-def test_conv2d_nchw(in_dtype):
- with Int8Fallback():
- # ResNet18 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 128, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 256, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 7, 512, 3, 1, 1)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- # bias, relu
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True, add_relu=True)
+ targets = [
+ (
+ "cuda",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ # Disable on CI since it does not support spirv int8 dot product
+ # (
+ # "vulkan -from_device=0",
+ # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ # topi.cuda.schedule_conv2d_NCHWc_int8,
+ # 4,
+ # False,
+ # ),
+ ]
- # dilation = 2
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
+ build_only_aarch64 = platform.machine() != "aarch64"
- # batch size
- verify_conv2d_NCHWc_int8(in_dtype, 4, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
+ targets.append(
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ )
+ )
- # weird workloads
- verify_conv2d_NCHWc_int8(in_dtype, 4, 4, 4, 8, 4, 4, 4)
+ if in_dtype == "int8":
+ targets += [
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ ),
+ (
+ "rocm -mattr=+dotprod",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(
+ a, w, s, p, d, l, o
+ ),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ ]
+
+ for target, compute, schedule, oc_block_factor, build_only in targets:
+ check_target(target, compute, schedule, oc_block_factor, build_only)
+
+
+# Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
+# performing basic testing - one test for all different scenarios - batch, dilation etc..
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (4, 4, 4, 4, 4, 4, 4, 1, False, False),
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 2, 2), 1, False, False),
+ ],
+)
+def test_conv2d_nchw_int8(in_dtype, params):
+ with Int8Fallback():
+ (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
+ )
- # inception v3 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 147, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 73, 80, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 80, 73, 192, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 48, 35, 64, 5, 1, 2)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 32, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 384, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 320, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 448, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1024, 19, 88, 3, 1, 1)
+ in_height = in_width = in_size
- # batch > 1
- verify_conv2d_NCHWc_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 8, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 32, 32, 149, 32, 3, 1, 0)
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+ bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
- # Asymmetric padding
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 128, 3, 1, (3, 3, 2, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, (1, 2, 2, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 17, 192, 1, 1, (1, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (3, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 8, 384, 3, 1, (0, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 392, 8, 64, 3, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 19, 64, 1, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 16, 32, 2, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (1, 2, 2, 1), add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 5, 2, (1, 3), add_bias=True)
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 3, 1, "VALID", add_bias=True, add_relu=True
- )
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 24, 1, "SAME", add_bias=True, add_relu=True
- )
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ bias_shape = get_const_tuple(bias.shape)
+ dtype = A.dtype
+
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
+ w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
- # Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
- # performing basic testing - one test for all different scenarios - batch, dilation etc..
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
- verify_conv2d_nchw_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 4, 4, 4, 4, 4, 4, 4)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 2, 2))
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+ return a_np, w_np, b_np, c_np
-def test_conv2d_nhwc():
- with Int8Fallback():
- # Subset of inception v3 expanded (dilation > 1, batch > 1, 'VALID' padding)
- verify_conv2d_NHWC_gemm_int8(1, 3, 299, 32, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 32, 149, 32, 3, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(4, 32, 147, 64, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 64, 73, 80, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 80, 73, 192, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 64, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 32, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 48, 35, 64, 5, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 128, 17, 192, 7, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 160, 7, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 192, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 768, 5, 128, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 320, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 448, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 448, 8, 384, 3, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 448, 1, 1, "SAME", add_bias=True, add_relu=True)
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 192, 1, 1, "SAME", add_bias=True)
-
- # Let's also verify that it compiles fine on AArch64 targets
- compile_conv2d_NHWC_gemm_int8_arm(1, 3, 299, 32, 3, 2, "SAME")
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ def verify_workload_padding():
+ _, _, out_height, out_width = get_const_tuple(c_np.shape)
+ wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
+
+ # for testing functionality,
+ # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
+ # regardless of the performance.
+ int32_lanes, num_int8_elements = num_filter, in_channel
+
+ # check if tile_ow candidates are the factors of the right output weight.
+ cfg = autotvm.get_config()
+ fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
+ ow_tile = np.prod(cfg["tile_ow"].size)
+
+ tvm.testing.assert_allclose(ow_tile, out_width)
+
+ def check_target(target):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ print("Running on target: %s" % target)
+ with tvm.target.Target(target):
+ C = topi.cuda.conv2d_nchw_int8(
+ A, W, (stride, stride), padding, (dilation, dilation), dtype
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = topi.cuda.schedule_conv2d_nchw_int8([C])
+
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
+ if add_bias:
+ func = tvm.build(
+ s,
+ [A, W, bias, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, b, c)
+ else:
+ func = tvm.build(
+ s,
+ [A, W, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, c)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
+
+ verify_workload_padding()
+
+ for target in ["cuda"]:
+ check_target(target)
Review Comment:
```suggestion
check_target("cuda")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] ekalda commented on a diff in pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
ekalda commented on code in PR #13669:
URL: https://github.com/apache/tvm/pull/13669#discussion_r1058905849
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ bias = te.placeholder(
+ (num_filter // oc_block_factor, 1, 1, oc_block_factor), name="bias", dtype=out_dtype
+ )
+ bias_shape = get_const_tuple(bias.shape)
- def check_target(target):
- dev = tvm.device(target, 0)
- if not tvm.testing.device_enabled(target):
- print("Skip because %s is not enabled" % target)
- return
- if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
- print("Skip because int8 intrinsics are not available")
- return
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=lo, high=hi, size=a_shape).astype(out_dtype)
+ w_np = np.random.randint(low=lo, high=hi, size=w_shape).astype(out_dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(
+ out_dtype
+ )
+
+ # convert to NCHWc
+ _, _, out_height, out_width = c_np.shape
+ c_np = c_np.reshape(
+ (batch, num_filter // oc_block_factor, oc_block_factor, out_height, out_width)
+ ).transpose(0, 1, 3, 4, 2)
+
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+
+ return a_np, w_np, b_np, c_np
+
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ with tvm.target.Target(target):
+ C = compute(
+ A,
+ W,
+ (stride, stride),
+ padding,
+ (dilation, dilation),
+ "NCHW",
+ "NCHW",
+ out_dtype,
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = schedule([C])
+
+ a = tvm.nd.array(a_np.astype(dtype), dev)
+ w = tvm.nd.array(w_np.astype(dtype), dev)
+ b = tvm.nd.array(b_np.astype(out_dtype), dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- print("Running on target: %s" % target)
- with tvm.target.Target(target):
- C = topi.cuda.conv2d_nchw_int8(
- A, W, (stride, stride), padding, (dilation, dilation), dtype
- )
if add_bias:
- C = topi.add(C, bias)
- if add_relu:
- C = topi.nn.relu(C)
- s = topi.cuda.schedule_conv2d_nchw_int8([C])
-
- a = tvm.nd.array(a_np, dev)
- w = tvm.nd.array(w_np, dev)
- b = tvm.nd.array(b_np, dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if add_bias:
- tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func = tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func(a, w, b, c)
- else:
+ compile_args = [A, W, bias, C]
+ run_args = [a, w, b, c]
+ else:
+ compile_args = [A, W, C]
+ run_args = [a, w, c]
+
func = tvm.build(
s,
- [A, W, C],
+ compile_args,
target,
name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
% (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
)
- func(a, w, c)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- verify_workload_padding()
+ if build_only:
+ return
- for target in ["cuda"]:
- check_target(target)
+ print("Running on target: %s" % target)
+ func(*run_args)
-@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
-def test_conv2d_nchw(in_dtype):
- with Int8Fallback():
- # ResNet18 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 128, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 256, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 7, 512, 3, 1, 1)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- # bias, relu
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True, add_relu=True)
+ targets = [
+ (
+ "cuda",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ # Disable on CI since it does not support spirv int8 dot product
+ # (
+ # "vulkan -from_device=0",
+ # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ # topi.cuda.schedule_conv2d_NCHWc_int8,
+ # 4,
+ # False,
+ # ),
+ ]
- # dilation = 2
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
+ build_only_aarch64 = platform.machine() != "aarch64"
- # batch size
- verify_conv2d_NCHWc_int8(in_dtype, 4, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
+ targets.append(
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ )
+ )
- # weird workloads
- verify_conv2d_NCHWc_int8(in_dtype, 4, 4, 4, 8, 4, 4, 4)
+ if in_dtype == "int8":
+ targets += [
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ ),
+ (
+ "rocm -mattr=+dotprod",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(
+ a, w, s, p, d, l, o
+ ),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ ]
+
+ for target, compute, schedule, oc_block_factor, build_only in targets:
+ check_target(target, compute, schedule, oc_block_factor, build_only)
+
+
+# Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
+# performing basic testing - one test for all different scenarios - batch, dilation etc..
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (4, 4, 4, 4, 4, 4, 4, 1, False, False),
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 2, 2), 1, False, False),
+ ],
+)
+def test_conv2d_nchw_int8(in_dtype, params):
+ with Int8Fallback():
+ (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
+ )
- # inception v3 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 147, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 73, 80, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 80, 73, 192, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 48, 35, 64, 5, 1, 2)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 32, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 384, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 320, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 448, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1024, 19, 88, 3, 1, 1)
+ in_height = in_width = in_size
- # batch > 1
- verify_conv2d_NCHWc_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 8, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 32, 32, 149, 32, 3, 1, 0)
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+ bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
- # Asymmetric padding
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 128, 3, 1, (3, 3, 2, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, (1, 2, 2, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 17, 192, 1, 1, (1, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (3, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 8, 384, 3, 1, (0, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 392, 8, 64, 3, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 19, 64, 1, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 16, 32, 2, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (1, 2, 2, 1), add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 5, 2, (1, 3), add_bias=True)
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 3, 1, "VALID", add_bias=True, add_relu=True
- )
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 24, 1, "SAME", add_bias=True, add_relu=True
- )
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ bias_shape = get_const_tuple(bias.shape)
+ dtype = A.dtype
+
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
+ w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
- # Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
- # performing basic testing - one test for all different scenarios - batch, dilation etc..
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
- verify_conv2d_nchw_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 4, 4, 4, 4, 4, 4, 4)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 2, 2))
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+ return a_np, w_np, b_np, c_np
-def test_conv2d_nhwc():
- with Int8Fallback():
- # Subset of inception v3 expanded (dilation > 1, batch > 1, 'VALID' padding)
- verify_conv2d_NHWC_gemm_int8(1, 3, 299, 32, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 32, 149, 32, 3, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(4, 32, 147, 64, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 64, 73, 80, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 80, 73, 192, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 64, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 32, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 48, 35, 64, 5, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 128, 17, 192, 7, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 160, 7, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 192, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 768, 5, 128, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 320, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 448, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 448, 8, 384, 3, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 448, 1, 1, "SAME", add_bias=True, add_relu=True)
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 192, 1, 1, "SAME", add_bias=True)
-
- # Let's also verify that it compiles fine on AArch64 targets
- compile_conv2d_NHWC_gemm_int8_arm(1, 3, 299, 32, 3, 2, "SAME")
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ def verify_workload_padding():
+ _, _, out_height, out_width = get_const_tuple(c_np.shape)
+ wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
+
+ # for testing functionality,
+ # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
+ # regardless of the performance.
+ int32_lanes, num_int8_elements = num_filter, in_channel
+
+ # check if tile_ow candidates are the factors of the right output weight.
+ cfg = autotvm.get_config()
+ fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
+ ow_tile = np.prod(cfg["tile_ow"].size)
+
+ tvm.testing.assert_allclose(ow_tile, out_width)
+
+ def check_target(target):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ print("Running on target: %s" % target)
+ with tvm.target.Target(target):
+ C = topi.cuda.conv2d_nchw_int8(
+ A, W, (stride, stride), padding, (dilation, dilation), dtype
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = topi.cuda.schedule_conv2d_nchw_int8([C])
+
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
+ if add_bias:
+ func = tvm.build(
+ s,
+ [A, W, bias, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, b, c)
+ else:
+ func = tvm.build(
+ s,
+ [A, W, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, c)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
+
+ verify_workload_padding()
+
+ for target in ["cuda"]:
+ check_target(target)
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] ekalda commented on a diff in pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
ekalda commented on code in PR #13669:
URL: https://github.com/apache/tvm/pull/13669#discussion_r1058905748
##########
tests/python/topi/python/test_topi_conv2d_int8.py:
##########
@@ -298,378 +175,462 @@ def get_ref_data():
a_np, w_np, b_np, c_np = get_ref_data()
- with tvm.target.Target(target):
- C = compute(
- A,
- W,
- (stride, stride),
- padding,
- (dilation, dilation),
- "NCHW",
- "NCHW",
- out_dtype,
- )
+ dev = tvm.device(target, 0)
+ with tvm.target.Target(target) as tvm_target:
+ C = compute(A, W, (stride, stride), padding, (dilation, dilation), dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = schedule([C])
- a = tvm.nd.array(a_np.astype(dtype), dev)
- w = tvm.nd.array(w_np.astype(dtype), dev)
- b = tvm.nd.array(b_np.astype(out_dtype), dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
-
- if add_bias:
- compile_args = [A, W, bias, C]
- run_args = [a, w, b, c]
- else:
- compile_args = [A, W, C]
- run_args = [a, w, c]
-
- func = tvm.build(
- s,
- compile_args,
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if build_only:
- return
+ build_inputs = [A, W, bias, C] if add_bias else [A, W, C]
+ inference_inputs = (a, w, b, c) if add_bias else (a, w, c)
+
+ func = tvm.build(
+ s,
+ build_inputs,
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
- print("Running on target: %s" % target)
+ build_only = tvm_target.features.is_aarch64 and (platform.machine() != "aarch64")
- func(*run_args)
+ if not build_only:
+ print("Running on target: %s" % target)
+ func(*inference_inputs)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- targets = [
- (
- "cuda",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- # Disable on CI since it does not support spirv int8 dot product
- # (
- # "vulkan -from_device=0",
- # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- # topi.cuda.schedule_conv2d_NCHWc_int8,
- # 4,
- # False,
- # ),
- ]
-
- build_only_aarch64 = platform.machine() != "aarch64"
-
- targets.append(
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ # ResNet18 workloads where channels in / out are multiple of oc_block_factor
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 1, 1, 0, 1, False, False),
+ (1, 64, 56, 128, 3, 2, 1, 1, False, False),
+ (1, 64, 56, 128, 1, 2, 0, 1, False, False),
+ (1, 128, 28, 128, 3, 1, 1, 1, False, False),
+ (1, 128, 28, 256, 3, 2, 1, 1, False, False),
+ (1, 128, 28, 256, 1, 2, 0, 1, False, False),
+ (1, 256, 14, 256, 3, 1, 1, 1, False, False),
+ (1, 256, 14, 512, 3, 2, 1, 1, False, False),
+ (1, 256, 14, 512, 1, 2, 0, 1, False, False),
+ (1, 512, 7, 512, 3, 1, 1, 1, False, False),
+ # bias, relu
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, True, True),
+ # dilation = 2
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ # batch size
+ (4, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ # weird workloads
+ (4, 4, 4, 8, 4, 4, 4, 1, False, False),
+ # inception v3 workloads where channels in / out are multiple of oc_block_factor
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 147, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 73, 80, 1, 1, 0, 1, False, False),
+ (1, 80, 73, 192, 3, 1, 0, 1, False, False),
+ (1, 192, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 192, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 48, 35, 64, 5, 1, 2, 1, False, False),
+ (1, 64, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 96, 35, 96, 3, 1, 1, 1, False, False),
+ (1, 192, 35, 32, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 256, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 64, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 48, 1, 1, 0, 1, False, False),
+ (1, 288, 35, 384, 3, 2, 0, 1, False, False),
+ (1, 96, 35, 96, 3, 2, 0, 1, False, False),
+ (1, 768, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 128, 1, 1, 0, 1, False, False),
+ (1, 128, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 128, 7, 1, 3, 1, False, False),
+ (1, 128, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 768, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 160, 1, 1, 0, 1, False, False),
+ (1, 160, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 160, 7, 1, 3, 1, False, False),
+ (1, 160, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 1, 1, 0, 1, False, False),
+ (1, 192, 17, 192, 7, 1, 3, 1, False, False),
+ (1, 192, 17, 320, 3, 2, 0, 1, False, False),
+ (1, 192, 17, 192, 3, 2, 0, 1, False, False),
+ (1, 1280, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 1280, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 384, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 448, 8, 384, 3, 1, 1, 1, False, False),
+ (1, 1280, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 320, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 384, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 448, 1, 1, 0, 1, False, False),
+ (1, 2048, 8, 192, 1, 1, 0, 1, False, False),
+ (1, 1024, 19, 88, 3, 1, 1, 1, False, False),
+ # batch > 1
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (8, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (32, 32, 149, 32, 3, 1, 0, 1, False, False),
+ # Asymmetric padding
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 64, 8, 128, 3, 1, (3, 3, 2, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, (1, 2, 2, 1), 1, False, False),
+ (1, 64, 17, 192, 1, 1, (1, 2), 1, False, False),
+ (1, 64, 8, 64, 3, 1, (3, 1), 1, False, False),
+ (1, 128, 8, 384, 3, 1, (0, 2), 1, False, False),
+ (1, 64, 8, 64, 1, 1, "VALID", 1, False, False),
+ (1, 392, 8, 64, 3, 1, "VALID", 1, False, False),
+ (1, 512, 19, 64, 1, 1, "SAME", 1, False, False),
+ (1, 64, 16, 32, 2, 1, "SAME", 1, False, False),
+ (1, 64, 8, 64, 3, 1, (1, 2, 2, 1), 1, False, True),
+ (1, 64, 8, 64, 5, 2, (1, 3), 1, True, False),
+ (1, 64, 56, 64, 3, 1, "VALID", 1, True, True),
+ (1, 64, 56, 64, 24, 1, "SAME", 1, True, True),
+ ],
+)
+def test_conv2d_NCHWc_int8(in_dtype, params):
+ with Int8Fallback():
(
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
)
- )
-
- if in_dtype == "int8":
- targets += [
- (
- "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
- topi.arm_cpu.conv2d_NCHWc_int8,
- topi.arm_cpu.schedule_conv2d_NCHWc_int8,
- 8,
- build_only_aarch64,
- ),
- (
- "rocm -mattr=+dotprod",
- lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
- topi.cuda.schedule_conv2d_NCHWc_int8,
- 4,
- False,
- ),
- ]
-
- for target, compute, schedule, oc_block_factor, build_only in targets:
- check_target(target, compute, schedule, oc_block_factor, build_only)
-
-
-def verify_conv2d_nchw_int8(
- in_dtype,
- batch,
- in_channel,
- in_size,
- num_filter,
- kernel,
- stride,
- padding,
- dilation=1,
- add_bias=False,
- add_relu=False,
-):
- pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
- padding_sum = pad_top + pad_left + pad_bottom + pad_right
- print(
- "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
- )
-
- in_height = in_width = in_size
-
- A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
- W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
- bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
-
- a_shape = get_const_tuple(A.shape)
- w_shape = get_const_tuple(W.shape)
- bias_shape = get_const_tuple(bias.shape)
- dtype = A.dtype
-
- @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
- def get_ref_data():
- a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
- w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
- c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
-
- if add_bias:
- b_np = np.random.uniform(size=bias_shape).astype(dtype)
- c_np += b_np
- if add_relu:
- c_np = np.maximum(c_np, 0)
-
- return a_np, w_np, b_np, c_np
-
- a_np, w_np, b_np, c_np = get_ref_data()
-
- def verify_workload_padding():
- _, _, out_height, out_width = get_const_tuple(c_np.shape)
- wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
-
- # for testing functionality,
- # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
- # regardless of the performance.
- int32_lanes, num_int8_elements = num_filter, in_channel
- # check if tile_ow candidates are the factors of the right output weight.
- cfg = autotvm.get_config()
- fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
- ow_tile = np.prod(cfg["tile_ow"].size)
-
- tvm.testing.assert_allclose(ow_tile, out_width)
+ in_height = in_width = in_size
+
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ dtype = A.dtype
+ out_dtype = "int32" if in_dtype == "int8" else "uint32"
+ lo = -128 if in_dtype == "int8" else 0
+ hi = 127 if in_dtype == "int8" else 255
+
+ def check_target(target, compute, schedule, oc_block_factor, build_only):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ bias = te.placeholder(
+ (num_filter // oc_block_factor, 1, 1, oc_block_factor), name="bias", dtype=out_dtype
+ )
+ bias_shape = get_const_tuple(bias.shape)
- def check_target(target):
- dev = tvm.device(target, 0)
- if not tvm.testing.device_enabled(target):
- print("Skip because %s is not enabled" % target)
- return
- if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
- print("Skip because int8 intrinsics are not available")
- return
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=lo, high=hi, size=a_shape).astype(out_dtype)
+ w_np = np.random.randint(low=lo, high=hi, size=w_shape).astype(out_dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(
+ out_dtype
+ )
+
+ # convert to NCHWc
+ _, _, out_height, out_width = c_np.shape
+ c_np = c_np.reshape(
+ (batch, num_filter // oc_block_factor, oc_block_factor, out_height, out_width)
+ ).transpose(0, 1, 3, 4, 2)
+
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(out_dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+
+ return a_np, w_np, b_np, c_np
+
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ with tvm.target.Target(target):
+ C = compute(
+ A,
+ W,
+ (stride, stride),
+ padding,
+ (dilation, dilation),
+ "NCHW",
+ "NCHW",
+ out_dtype,
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = schedule([C])
+
+ a = tvm.nd.array(a_np.astype(dtype), dev)
+ w = tvm.nd.array(w_np.astype(dtype), dev)
+ b = tvm.nd.array(b_np.astype(out_dtype), dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- print("Running on target: %s" % target)
- with tvm.target.Target(target):
- C = topi.cuda.conv2d_nchw_int8(
- A, W, (stride, stride), padding, (dilation, dilation), dtype
- )
if add_bias:
- C = topi.add(C, bias)
- if add_relu:
- C = topi.nn.relu(C)
- s = topi.cuda.schedule_conv2d_nchw_int8([C])
-
- a = tvm.nd.array(a_np, dev)
- w = tvm.nd.array(w_np, dev)
- b = tvm.nd.array(b_np, dev)
- c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
- if add_bias:
- tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func = tvm.build(
- s,
- [A, W, bias, C],
- target,
- name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
- % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
- )
- func(a, w, b, c)
- else:
+ compile_args = [A, W, bias, C]
+ run_args = [a, w, b, c]
+ else:
+ compile_args = [A, W, C]
+ run_args = [a, w, c]
+
func = tvm.build(
s,
- [A, W, C],
+ compile_args,
target,
name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
% (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation),
)
- func(a, w, c)
- tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- verify_workload_padding()
+ if build_only:
+ return
- for target in ["cuda"]:
- check_target(target)
+ print("Running on target: %s" % target)
+ func(*run_args)
-@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
-def test_conv2d_nchw(in_dtype):
- with Int8Fallback():
- # ResNet18 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 128, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 128, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 28, 256, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 256, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 3, 2, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 14, 512, 1, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 7, 512, 3, 1, 1)
+ tvm.testing.assert_allclose(c.numpy(), c_np, rtol=1e-5)
- # bias, relu
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_bias=True, add_relu=True)
+ targets = [
+ (
+ "cuda",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ # Disable on CI since it does not support spirv int8 dot product
+ # (
+ # "vulkan -from_device=0",
+ # lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(a, w, s, p, d, l, o),
+ # topi.cuda.schedule_conv2d_NCHWc_int8,
+ # 4,
+ # False,
+ # ),
+ ]
- # dilation = 2
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
+ build_only_aarch64 = platform.machine() != "aarch64"
- # batch size
- verify_conv2d_NCHWc_int8(in_dtype, 4, 64, 56, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
+ targets.append(
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon,+v8.2a,+dotprod",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ )
+ )
- # weird workloads
- verify_conv2d_NCHWc_int8(in_dtype, 4, 4, 4, 8, 4, 4, 4)
+ if in_dtype == "int8":
+ targets += [
+ (
+ "llvm -device arm_cpu -mtriple aarch64-linux-gnu -mattr=+neon",
+ topi.arm_cpu.conv2d_NCHWc_int8,
+ topi.arm_cpu.schedule_conv2d_NCHWc_int8,
+ 8,
+ build_only_aarch64,
+ ),
+ (
+ "rocm -mattr=+dotprod",
+ lambda a, w, s, p, d, l, ol, o: topi.cuda.conv2d_NCHWc_int8(
+ a, w, s, p, d, l, o
+ ),
+ topi.cuda.schedule_conv2d_NCHWc_int8,
+ 4,
+ False,
+ ),
+ ]
+
+ for target, compute, schedule, oc_block_factor, build_only in targets:
+ check_target(target, compute, schedule, oc_block_factor, build_only)
+
+
+# Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
+# performing basic testing - one test for all different scenarios - batch, dilation etc..
+@pytest.mark.parametrize("in_dtype", ["int8", "uint8"])
+@pytest.mark.parametrize(
+ "params",
+ [
+ (1, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (1, 64, 56, 64, 3, 1, 1, 1, False, True),
+ (1, 64, 56, 64, 3, 1, 1, 2, False, False),
+ (9, 64, 56, 64, 3, 1, 1, 1, False, False),
+ (4, 4, 4, 4, 4, 4, 4, 1, False, False),
+ (1, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (7, 32, 149, 32, 3, 1, 0, 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 1, 1), 1, False, False),
+ (1, 32, 35, 64, 7, 2, (0, 0, 2, 2), 1, False, False),
+ ],
+)
+def test_conv2d_nchw_int8(in_dtype, params):
+ with Int8Fallback():
+ (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding,
+ dilation,
+ add_bias,
+ add_relu,
+ ) = params
+ pad_top, pad_left, pad_bottom, pad_right = get_pad_tuple(padding, (kernel, kernel))
+ padding_sum = pad_top + pad_left + pad_bottom + pad_right
+ print(
+ "Workload: (%d, %d, %d, %d, %d, %d, %d, %d)"
+ % (batch, in_channel, in_size, num_filter, kernel, stride, padding_sum, dilation)
+ )
- # inception v3 workloads where channels in / out are multiple of oc_block_factor
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 147, 64, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 73, 80, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 80, 73, 192, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 48, 35, 64, 5, 1, 2)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 35, 32, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 256, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 64, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 48, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 288, 35, 384, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 96, 35, 96, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 128, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 768, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 160, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 160, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 7, 1, 3)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 320, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 192, 17, 192, 3, 2, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 384, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 448, 8, 384, 3, 1, 1)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1280, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 320, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 384, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 448, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 2048, 8, 192, 1, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 1024, 19, 88, 3, 1, 1)
+ in_height = in_width = in_size
- # batch > 1
- verify_conv2d_NCHWc_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 8, 32, 149, 32, 3, 1, 0)
- verify_conv2d_NCHWc_int8(in_dtype, 32, 32, 149, 32, 3, 1, 0)
+ A = te.placeholder((batch, in_channel, in_height, in_width), name="A", dtype=in_dtype)
+ W = te.placeholder((num_filter, in_channel, kernel, kernel), name="W", dtype=in_dtype)
+ bias = te.placeholder((num_filter, 1, 1), name="bias", dtype=in_dtype)
- # Asymmetric padding
- verify_conv2d_NCHWc_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 128, 3, 1, (3, 3, 2, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, (1, 2, 2, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 17, 192, 1, 1, (1, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (3, 1))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 128, 8, 384, 3, 1, (0, 2))
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 1, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 392, 8, 64, 3, 1, "VALID")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 512, 19, 64, 1, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 16, 32, 2, 1, "SAME")
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 3, 1, (1, 2, 2, 1), add_relu=True)
- verify_conv2d_NCHWc_int8(in_dtype, 1, 64, 8, 64, 5, 2, (1, 3), add_bias=True)
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 3, 1, "VALID", add_bias=True, add_relu=True
- )
- verify_conv2d_NCHWc_int8(
- in_dtype, 1, 64, 56, 64, 24, 1, "SAME", add_bias=True, add_relu=True
- )
+ a_shape = get_const_tuple(A.shape)
+ w_shape = get_const_tuple(W.shape)
+ bias_shape = get_const_tuple(bias.shape)
+ dtype = A.dtype
+
+ @memoize("topi.tests.test_topi_conv2d_int8.verify_conv2d_nchw")
+ def get_ref_data():
+ a_np = np.random.randint(low=-128, high=127, size=a_shape).astype(dtype)
+ w_np = np.random.randint(low=-128, high=128, size=w_shape).astype(dtype)
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ dw_np = tvm.topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
+ c_np = tvm.topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding).astype(dtype)
- # Conv2d NCHW int8 schedule testing. Internally, it uses NCHWc schedule. So, just
- # performing basic testing - one test for all different scenarios - batch, dilation etc..
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, add_relu=True)
- verify_conv2d_nchw_int8(in_dtype, 1, 64, 56, 64, 3, 1, 1, dilation=2)
- verify_conv2d_nchw_int8(in_dtype, 9, 64, 56, 64, 3, 1, 1)
- verify_conv2d_nchw_int8(in_dtype, 4, 4, 4, 4, 4, 4, 4)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 7, 32, 149, 32, 3, 1, 0)
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 1, 1))
- verify_conv2d_nchw_int8(in_dtype, 1, 32, 35, 64, 7, 2, (0, 0, 2, 2))
+ if add_bias:
+ b_np = np.random.uniform(size=bias_shape).astype(dtype)
+ c_np += b_np
+ if add_relu:
+ c_np = np.maximum(c_np, 0)
+ return a_np, w_np, b_np, c_np
-def test_conv2d_nhwc():
- with Int8Fallback():
- # Subset of inception v3 expanded (dilation > 1, batch > 1, 'VALID' padding)
- verify_conv2d_NHWC_gemm_int8(1, 3, 299, 32, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 32, 149, 32, 3, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(4, 32, 147, 64, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 64, 73, 80, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 80, 73, 192, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 64, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 35, 32, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 48, 35, 64, 5, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 256, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 64, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 288, 35, 48, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 96, 35, 96, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 128, 17, 192, 7, 1, "SAME", dilation=2)
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 160, 7, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 160, 17, 192, 1, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 768, 5, 128, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 320, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 192, 17, 192, 3, 2, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 192, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 1280, 8, 448, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 384, 8, 384, 3, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 448, 8, 384, 3, 1, "VALID")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 320, 1, 1, "SAME")
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 448, 1, 1, "SAME", add_bias=True, add_relu=True)
- verify_conv2d_NHWC_gemm_int8(1, 2048, 8, 192, 1, 1, "SAME", add_bias=True)
-
- # Let's also verify that it compiles fine on AArch64 targets
- compile_conv2d_NHWC_gemm_int8_arm(1, 3, 299, 32, 3, 2, "SAME")
+ a_np, w_np, b_np, c_np = get_ref_data()
+
+ def verify_workload_padding():
+ _, _, out_height, out_width = get_const_tuple(c_np.shape)
+ wkl = _get_workload(A, W, (stride, stride), padding, dilation, dtype)
+
+ # for testing functionality,
+ # we choose arbitrary int32_lanes and num_int8_elements can divide the channel,
+ # regardless of the performance.
+ int32_lanes, num_int8_elements = num_filter, in_channel
+
+ # check if tile_ow candidates are the factors of the right output weight.
+ cfg = autotvm.get_config()
+ fallback_schedule_cpu_common_int8(cfg, wkl, int32_lanes, num_int8_elements)
+ ow_tile = np.prod(cfg["tile_ow"].size)
+
+ tvm.testing.assert_allclose(ow_tile, out_width)
+
+ def check_target(target):
+ dev = tvm.device(target, 0)
+ if not tvm.testing.device_enabled(target):
+ print("Skip because %s is not enabled" % target)
+ return
+ if target == "cuda" and not tvm.contrib.nvcc.have_int8(dev.compute_version):
+ print("Skip because int8 intrinsics are not available")
+ return
+
+ print("Running on target: %s" % target)
+ with tvm.target.Target(target):
+ C = topi.cuda.conv2d_nchw_int8(
+ A, W, (stride, stride), padding, (dilation, dilation), dtype
+ )
+ if add_bias:
+ C = topi.add(C, bias)
+ if add_relu:
+ C = topi.nn.relu(C)
+ s = topi.cuda.schedule_conv2d_nchw_int8([C])
+
+ a = tvm.nd.array(a_np, dev)
+ w = tvm.nd.array(w_np, dev)
+ b = tvm.nd.array(b_np, dev)
+ c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), dev)
+ if add_bias:
+ func = tvm.build(
+ s,
+ [A, W, bias, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, b, c)
+ else:
+ func = tvm.build(
+ s,
+ [A, W, C],
+ target,
+ name="relu_%d_%d_%d_%d_%d_%d_%d_%d"
+ % (
+ batch,
+ in_channel,
+ in_size,
+ num_filter,
+ kernel,
+ stride,
+ padding_sum,
+ dilation,
+ ),
+ )
+ func(a, w, c)
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] Mousius merged pull request #13669: [TOPI][bugfix] Fix a bug in arm_cpu int8 dotprod schedule and modernize tests
Posted by GitBox <gi...@apache.org>.
Mousius merged PR #13669:
URL: https://github.com/apache/tvm/pull/13669
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org