You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2022/12/07 09:05:16 UTC
[GitHub] [tvm] echuraev commented on a diff in pull request #13573: [Adreno] Add global pooling schedule

echuraev commented on code in PR #13573:
URL: https://github.com/apache/tvm/pull/13573#discussion_r1041923962


##########
python/tvm/topi/adreno/pooling.py:
##########
@@ -19,6 +19,115 @@
 import tvm
 from tvm import te
 from .. import tag
+from .utils import get_div
+
+
+def schedule_adaptive_pool(outs, layout="NCHW"):
+    """Schedule for adaptive_pool.
+
+    Parameters
+    ----------
+    outs: Array of Tensor
+        The computation graph description of adaptive_pool
+        in the format of an array of tensors.
+
+    Returns
+    -------
+    s: Schedule
+        The computation schedule for adaptive_pool.
+    """
+    outs = [outs] if isinstance(outs, te.tensor.Tensor) else outs
+    s = te.create_schedule([x.op for x in outs])
+
+    def _schedule_global(Pool, layout):
+        # examples of latest pool op is global max pool and non latest is global avg pooling
+        # OL - an Expr will be used for rfactor
+        # Out - programming of the parallelizm on the global level
+        # shared is not required, local could be enough but shared scope gives quite significant
+        # perf boost
+        if Pool.op in s.outputs:
+            Out = Pool
+            OL = s.cache_write(Pool, "shared")
+        else:
+            Out = outs[0].op.output(0)
+            s[Pool].set_scope("shared")
+            OL = Pool
+
+        PaddedInput = Pool.op.input_tensors[0]
+
+        # detect axis for later reorder and binding of batch/chennel to blocks and

Review Comment:
   ```suggestion
           # detect axis for later reorder and binding of batch/channel to blocks and
   ```



##########
python/tvm/topi/adreno/pooling.py:
##########
@@ -19,6 +19,115 @@
 import tvm
 from tvm import te
 from .. import tag
+from .utils import get_div
+
+
+def schedule_adaptive_pool(outs, layout="NCHW"):
+    """Schedule for adaptive_pool.
+
+    Parameters
+    ----------
+    outs: Array of Tensor
+        The computation graph description of adaptive_pool
+        in the format of an array of tensors.
+
+    Returns
+    -------
+    s: Schedule
+        The computation schedule for adaptive_pool.
+    """
+    outs = [outs] if isinstance(outs, te.tensor.Tensor) else outs
+    s = te.create_schedule([x.op for x in outs])
+
+    def _schedule_global(Pool, layout):
+        # examples of latest pool op is global max pool and non latest is global avg pooling
+        # OL - an Expr will be used for rfactor
+        # Out - programming of the parallelizm on the global level
+        # shared is not required, local could be enough but shared scope gives quite significant
+        # perf boost
+        if Pool.op in s.outputs:
+            Out = Pool
+            OL = s.cache_write(Pool, "shared")
+        else:
+            Out = outs[0].op.output(0)
+            s[Pool].set_scope("shared")
+            OL = Pool
+
+        PaddedInput = Pool.op.input_tensors[0]
+
+        # detect axis for later reorder and binding of batch/chennel to blocks and
+        # spatial to threads
+        if layout in ("NCHW", "NCHW4c"):
+            channel_index = 1
+            height_index = 2
+            width_index = 3
+        else:
+            channel_index = 3
+            height_index = 1
+            width_index = 2
+
+        if isinstance(PaddedInput.op, tvm.te.ComputeOp):
+            s[PaddedInput].compute_inline()
+
+        fused_reduce = s[OL].fuse(
+            *[s[OL].op.reduce_axis[i] for i in range(len(s[OL].op.reduce_axis))]
+        )
+
+        spatial = PaddedInput.shape[height_index].value * PaddedInput.shape[width_index].value
+        max_threads = spatial // 25 if spatial > 25 else 1
+        max_threads = 256 if max_threads > 256 else max_threads

Review Comment:
   Why 25 and 256?



##########
python/tvm/relay/op/strategy/adreno.py:
##########
@@ -215,6 +215,13 @@ def schedule_reduce_adreno(attrs, outs, target):
         return topi.adreno.schedule_reduce(outs)
 
 
+@schedule_adaptive_pool.register(["adreno"])
+def schedule_adaptive_pool_cuda(attrs, outs, target):
+    """schedule adaptive pooling ops for cuda"""

Review Comment:
   ```suggestion
   def schedule_adaptive_pool_adreno(attrs, outs, target):
       """schedule adaptive pooling ops for adreno"""
   ```



##########
python/tvm/topi/adreno/pooling.py:
##########
@@ -19,6 +19,115 @@
 import tvm
 from tvm import te
 from .. import tag
+from .utils import get_div
+
+
+def schedule_adaptive_pool(outs, layout="NCHW"):
+    """Schedule for adaptive_pool.
+
+    Parameters
+    ----------
+    outs: Array of Tensor
+        The computation graph description of adaptive_pool
+        in the format of an array of tensors.
+
+    Returns
+    -------
+    s: Schedule
+        The computation schedule for adaptive_pool.
+    """
+    outs = [outs] if isinstance(outs, te.tensor.Tensor) else outs
+    s = te.create_schedule([x.op for x in outs])
+
+    def _schedule_global(Pool, layout):
+        # examples of latest pool op is global max pool and non latest is global avg pooling
+        # OL - an Expr will be used for rfactor
+        # Out - programming of the parallelizm on the global level
+        # shared is not required, local could be enough but shared scope gives quite significant
+        # perf boost
+        if Pool.op in s.outputs:
+            Out = Pool
+            OL = s.cache_write(Pool, "shared")
+        else:
+            Out = outs[0].op.output(0)
+            s[Pool].set_scope("shared")
+            OL = Pool
+
+        PaddedInput = Pool.op.input_tensors[0]
+
+        # detect axis for later reorder and binding of batch/chennel to blocks and
+        # spatial to threads
+        if layout in ("NCHW", "NCHW4c"):
+            channel_index = 1
+            height_index = 2
+            width_index = 3
+        else:
+            channel_index = 3
+            height_index = 1
+            width_index = 2
+
+        if isinstance(PaddedInput.op, tvm.te.ComputeOp):
+            s[PaddedInput].compute_inline()
+
+        fused_reduce = s[OL].fuse(
+            *[s[OL].op.reduce_axis[i] for i in range(len(s[OL].op.reduce_axis))]
+        )

Review Comment:
   ```suggestion
           fused_reduce = s[OL].fuse(*s[OL].op.reduce_axis)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org