You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/02/11 01:55:46 UTC

[GitHub] [tvm] ymwangg opened a new pull request #7441: [Relay][Tensorflow] Add unique operator

ymwangg opened a new pull request #7441:
URL: https://github.com/apache/tvm/pull/7441


   This PR adds the tensorflow `unique` operator as described in https://www.tensorflow.org/api_docs/python/tf/unique.
   
   I'm not sure I follow the best practices. Comments and suggestions are welcome. @yongwww @kevinthesun @codeislife99


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780284547


   It can be a lot simpler than that. Unique is basically sort + adjacent difference + exclusive scan. If you don't understand that statement, the following example should help. We have exclusive scan for CPU (`cumsum` op with `exclusive=True`), and GPU (see https://github.com/apache/tvm/pull/7303).
   
   If we implement unique this way, the same code runs on both CPU and GPU.
   ```
   import numpy as np
   
   
   def exclusive_scan(arr):
       return np.cumsum(arr) - arr
   
   
   inp = np.random.randint(0, 10, size=(15,))
   argsort_indices = np.argsort(inp)
   sorted_inp = np.array([inp[i] for i in argsort_indices])
   print("sorted input:", sorted_inp)
   
   adj_diff = np.concatenate([[sorted_inp[0]],  np.diff(sorted_inp)])
   print("adjacent difference:", adj_diff)
   
   non_zero = adj_diff != 0
   non_zero[0] = True  # the first element is always selected
   ex_scan = exclusive_scan(non_zero)
   print("exclusive scan:", ex_scan)
   
   unique = np.zeros(inp.shape[0], dtype=np.int)
   
   for i in range(inp.shape[0]):
       if non_zero[i] != 0:
           unique[ex_scan[i]] = inp[argsort_indices[i]]
   
   print("num unique element:", ex_scan[-1] + 1)
   print("unique:", unique)
   ```
   
   Output:
   ```
   sorted input: [0 0 0 4 5 5 6 6 6 6 6 7 8 8 9]
   adjacent difference: [0 0 0 4 1 0 1 0 0 0 0 1 1 0 1]
   exclusive scan: [0 1 1 1 2 3 3 4 4 4 4 4 5 6 6]
   num unique element: 7
   unique: [0 4 5 6 7 8 9 0 0 0 0 0 0 0 0]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-777181989


   Thanks, I was planning to work on unique next week, happy to collaborate.
   
   I can add TIR unqiue impl both cpu and gpu later. We can add relay boilarplate, temp impl in cpp, and tests in this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582684929



##########
File path: python/tvm/relay/frontend/tensorflow.py
##########
@@ -2324,6 +2324,39 @@ def _impl(inputs, attr, params, mod):
     return _impl
 
 
+def _unique(return_counts=True):
+    def _impl(inputs, attr, params, mod):
+        assert len(inputs) == 1
+        data = inputs[0]
+        if return_counts:
+            [unique, indices, num_uniq, counts] = _op.unique(
+                data, is_sorted=False, return_counts=True
+            )
+            unique_sliced = _op.strided_slice(unique, begin=[0], end=num_uniq, slice_mode="size")
+            counts_sliced = _op.strided_slice(counts, begin=[0], end=num_uniq, slice_mode="size")
+            return _expr.TupleWrapper(
+                _expr.Tuple([unique_sliced, indices, counts_sliced]),
+                3,
+            )
+        else:
+            [unique, indices, num_uniq] = _op.unique(data, is_sorted=False, return_counts=False)
+            unique_sliced = _op.strided_slice(unique, begin=[0], end=num_uniq, slice_mode="size")
+            return _expr.TupleWrapper(
+                _expr.Tuple([unique_sliced, indices]),
+                2,
+            )
+
+    return _impl
+
+
+def _unique_with_counts():
+    def _impl(inputs, attr, params, mod):

Review comment:
       Remove this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-782545316


   Can you also add pytorch frontend? Not all option need to be supported. Likely the same as tf conversion


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r574883250



##########
File path: python/tvm/relay/op/strategy/generic.py
##########
@@ -1395,3 +1395,28 @@ def cumsum_strategy(attrs, inputs, out_type, target):
         name="cumsum.generic",
     )
     return strategy
+
+
+def wrap_compute_unique(topi_compute):
+    """Wrap unique topi compute"""
+
+    def _compute_unique(attrs, inputs, _):
+        return topi_compute(inputs[0])

Review comment:
       Since `output[0]`,`output[1]`,`output[2]` have the same shape as `input[0]` and `output[3]` has shape `[1]`. We don't need the shape function in compute function.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-782545404


   @masahi Yeah, I only added CPU version in this PR. I'm not very familiar with GPU IR now but I can do it later. If the overall structure looks good, I can add `unique_with_counts` in future PR since their implementations are very similar. 
   
   I'll add the pytorch frontend in this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r581373778



##########
File path: python/tvm/relay/frontend/pytorch.py
##########
@@ -2164,6 +2164,21 @@ def is_floating_point(self, inputs, input_types):
         is_float = input_type in ["float32", "float64", "float16", "bfloat16"]
         return _expr.const(is_float)
 
+    def unique(self, inputs, input_types):
+        assert len(inputs) == 4
+        [data, is_sorted, return_inverse, return_counts] = inputs
+        if return_counts:

Review comment:
       changed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780440200


   For your first implementation, combination-based approach is ok. But `unique` is important enough that I think it deserves its own operator. Also implementation directly in ir builder will likely be faster. Supporting other options will also be easier if we write in ir builder.
   
   So use ir builder if you are comfortable with it, otherwise combination of relay ops is fine. Performance + support for options can be done later (by me).  
   
   Don't worry about `unique_by_key`. Last time I checked the pytorch implementation, I concluded that we can do eveything pytorch does via ir builder.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-782545404


   @masahi Yeah, I only added CPU version in this PR. I'm not very familiar with GPU IR now but I can do it later. If the overall structure looks good, I can add `unique_with_counts` since their implementations are very similar. 
   
   I'll add the pytorch frontend in this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r583081232



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       Right, the main problem is for a `tid` with condition `inc_scan[tid] != inc_scan [tid]` we also need to know the previous `tid` with the same condition. It's possible to use `count` to store the cumulative sum but getting the adjacent diff in-place requires global sync.
   
   I think we can use `indices_ptr` as a tmp buf to avoid allocating a new array.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582551446



##########
File path: python/tvm/relay/frontend/tensorflow.py
##########
@@ -2324,6 +2324,35 @@ def _impl(inputs, attr, params, mod):
     return _impl
 
 
+def _unique():
+    def _impl(inputs, attr, params, mod):
+        assert len(inputs) == 1
+        data = inputs[0]
+        [unique, indices, num_uniq] = _op.unique(data, is_sorted=False, return_counts=False)
+        unique_sliced = _op.strided_slice(unique, begin=[0], end=num_uniq, slice_mode="size")
+        return _expr.TupleWrapper(
+            _expr.Tuple([unique_sliced, indices]),
+            2,
+        )
+
+    return _impl
+
+
+def _unique_with_counts():
+    def _impl(inputs, attr, params, mod):
+        assert len(inputs) == 1
+        data = inputs[0]
+        [unique, indices, num_uniq, counts] = _op.unique(data, is_sorted=False, return_counts=True)
+        unique_sliced = _op.strided_slice(unique, begin=[0], end=num_uniq, slice_mode="size")
+        counts_sliced = _op.strided_slice(counts, begin=[0], end=num_uniq, slice_mode="size")
+        return _expr.TupleWrapper(
+            _expr.Tuple([unique_sliced, indices, counts_sliced]),
+            3,
+        )
+
+    return _impl
+

Review comment:
       good suggestion, changed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-777861757


   @ymwangg For a general op like `unique`, we should follow numpy API, rather than being too specific to TF. PyTorch unique should be supported by the same API. Framework specific details should go into the frontend.
   
   Numpy and PyTorch supports `dim` argument to do unique on multidimensional input, but I don't think it's a good idea. So restricting to 1D, at least for the first implementation, sounds good to me.
   
   We can implement `unique` via sorting and cumsum. If implemented this way, the same code works on both CPU and GPU. That's I'm planning to do, but if you feel brave, you can try that in this PR :slightly_smiling_face: 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582593774



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       I'm imaging that we can have something like (it is probably wrong but hopefully it conveys my idea)
   
   `counts_ptr[inc_scan_ptr[tid] - 1] = tid - indices_ptr[inc_scan_ptr[tid - 1]]`
   
   since the contents of `arange` seems redundant given other information you already have here, I have a strong feeling that we don't need to materialize `arange`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] codeislife99 commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
codeislife99 commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r574219984



##########
File path: python/tvm/relay/op/_transform.py
##########
@@ -885,3 +886,29 @@ def where_shape_func(attrs, inputs, _):
     out_shape = _broadcast_shape_tensors(bcast_shape, cond_shape)
 
     return [out_shape]
+
+
+register_strategy("unique", strategy.unique_strategy)
+register_pattern("unique", OpPattern.OPAQUE)
+
+
+@script
+def _unique_shape_1(data_shape):
+    shape_tensor = output_tensor((1,), "int64")
+    shape_tensor[0] = int64(data_shape[0])
+    return shape_tensor
+
+
+@script
+def _unique_shape_2(inputs):
+    shape_tensor = output_tensor((1,), "int64")
+    shape_tensor[0] = int64(1)
+    return shape_tensor
+
+
+@_reg.register_shape_func("unique", False)
+def unique_shape_func(attrs, inputs, _):
+    """
+    Shape func for unique operator.
+    """
+    return [_unique_shape_1(inputs[0]), _unique_shape_1(inputs[0]), _unique_shape_1(inputs[0]), _unique_shape_2(inputs[0])]

Review comment:
       You can put this in the same shape func, You can return a tuple from the shape func and remove the list from this function. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r581459643



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,384 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name, no-else-return
+"""Unique operator"""
+from tvm import te, tir
+import tvm
+
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, adjacent_diff):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    adjacent_diff_ptr = ib.buffer_ptr(adjacent_diff)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                adjacent_diff_ptr[tid] = 0
+            with ib.else_scope():
+                with ib.if_scope(data_ptr[tid] != data_ptr[tid - 1]):
+                    adjacent_diff_ptr[tid] = 1
+                with ib.else_scope():
+                    adjacent_diff_ptr[tid] = 0
+    return ib.get()
+
+
+@hybrid.script
+def _calc_num_unique(data):
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = data[data.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_sorted_ir(data, argsorted_indices, inc_scan, unique_elements, indices):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            indices_ptr[argsorted_indices_ptr[tid]] = inc_scan_ptr[tid]
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[inc_scan_ptr[tid]] = data_ptr[argsorted_indices_ptr[tid]]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[inc_scan_ptr[tid]] = data_ptr[argsorted_indices_ptr[tid]]
+    return ib.get()
+
+
+def _calc_counts_sorted_ir(inc_scan, counts):
+    ib = tvm.tir.ir_builder.create()
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    counts_ptr = ib.buffer_ptr(counts)
+
+    batch_size = inc_scan.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            counts_ptr[tid] = 0
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        atomic_add_return = ib.allocate(counts.dtype, (1,), name="atomic_add_return", scope="local")
+        with ib.if_scope(tid < batch_size):
+            index = inc_scan_ptr[tid]
+            atomic_add_return[0] = tvm.tir.call_intrin(
+                counts.dtype,
+                "tir.atomic_add",
+                tvm.tir.call_intrin("handle", "tir.address_of", counts_ptr[index]),
+                1,
+            )
+    return ib.get()
+
+
+def _calc_first_occurence_ir(argsorted_indices, inc_scan, first_occurence):
+    ib = tvm.tir.ir_builder.create()
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    first_occurence_ptr = ib.buffer_ptr(first_occurence)
+    batch_size = argsorted_indices.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            first_occurence_ptr[tid] = batch_size
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                first_occurence_ptr[inc_scan_ptr[tid]] = argsorted_indices_ptr[tid]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    first_occurence_ptr[inc_scan_ptr[tid]] = argsorted_indices_ptr[tid]
+    return ib.get()
+
+
+def _calc_unique_unsorted_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices
+):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    index_converter_ptr = ib.buffer_ptr(index_converter)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            indices_ptr[argsorted_indices_ptr[tid]] = index_converter_ptr[inc_scan_ptr[tid]]
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[index_converter_ptr[inc_scan_ptr[tid]]] = data_ptr[
+                    argsorted_indices_ptr[tid]
+                ]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[index_converter_ptr[inc_scan_ptr[tid]]] = data_ptr[
+                        argsorted_indices_ptr[tid]
+                    ]
+    return ib.get()
+
+
+def _calc_counts_unsorted_ir(inc_scan, index_converter, counts):

Review comment:
       This looks similar to `_calc_counts_sorted_ir`, maybe you can do some trick around `index_converter` to share the implementation?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r579566303



##########
File path: tests/python/relay/test_op_level3.py
##########
@@ -1453,5 +1453,53 @@ def verify_scatter_nd_with_stack(data_np, indices_np, shape, ref_res, rtol=1e-5,
     verify_scatter_nd_with_stack(data, indices, shape, out)
 
 
+@tvm.testing.uses_gpu
+def test_unique():
+    def calc_numpy_unique(data, is_sorted=False):
+        uniq, index, inverse, counts = np.unique(
+            data, return_index=True, return_inverse=True, return_counts=True
+        )
+        num_uniq = np.array([len(uniq)]).astype("int32")
+        if not is_sorted:
+            order = np.argsort(index)
+            reverse_order = np.argsort(order)
+            uniq = uniq[order].astype(data.dtype)
+            inverse = np.array([reverse_order[i] for i in inverse]).astype("int32")
+            counts = counts[order].astype("int32")
+        return [uniq.astype(data.dtype), inverse.astype("int32"), counts, num_uniq]
+
+    def verify_unique(n, dtype, is_dyn=False, is_sorted=False):
+        if is_dyn:
+            x = relay.var("x", relay.TensorType([relay.Any()], dtype))
+        else:
+            x = relay.var("x", relay.TensorType([n], dtype))
+        outs = relay.unique(x, is_sorted)
+        outs = outs.astuple()
+        func = relay.Function([x], outs)
+        x_data = np.random.randint(50, size=n).astype(dtype)
+
+        if is_dyn:
+            backends = ["vm", "debug"]
+        else:
+            backends = ["graph", "debug"]
+        for target, ctx in tvm.testing.enabled_targets():

Review comment:
       This will probably try to run on GPU




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780229483


   @masahi Thanks for your comment.
   Here's the algorithm that I came up with based on your suggestions.
   ```python
   # topi
   def unique(data, data_sorted, data_argsorted):
       output = [0] * len(data)
       count = [0] * len(data)
       first_occurrence = [len(data)] * len(data)
       inverse_indices = [0] * len(data)
       num_unique = 0
       # ir_builder
       for i in range(len(data)):
           if i == 0 or data_sorted[i] != data_sorted[i-1]:
               num_unique += 1
               output[num_unique-1] = data_sorted[i]
               first_occurrence[num_unique-1] = min(first_occurrence[num_unique-1], data_argsorted[i])
           count[num_unique-1] += 1
           inverse_indices[data_argsorted[i]] = num_unique - 1
       return output, count, first_occurrence, inverse_indices, num_unique
   
   # tf front end
   def tf_unique(data):
       output, count, first_occurrence, inverse_indices, num_unique = unique(data, np.sort(data), np.argsort(data))
       sorted_occurence_indices = np.argsort(first_occurrence) # relay.argsort
       new_output = [output[sorted_occurence_indices[i]] for i in range(num_unique)] # relay.take
       index_converter = np.argsort(sorted_occurence_indices) # relay.argsort
       new_inverse_indices = [index_converter[i] for i in inverse_indices] # relay.take
       return new_output, new_inverse_indices
   ```
   It defines a topi function that is similar to `np.unique` but requires the sorted data and the argsort of the data. In the frontend, it needs to do argsort twice if we want to keep the unique elements in the order of their first occurrence.
   
   Does this look good to you?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582495221



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       Probably we don't need to allocate `arange` array explicitly. Just assign `tid` should work?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582495221



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       Probably we don't need to allocate `arange` array explicitly. Just assign `tid` should work? I didn't put much thought into this, I could be wrong.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582593774



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       I'm imaging that we can have something like
   
   `counts_ptr[inc_scan_ptr[tid] - 1] = tid - indices_ptr[inc_scan_ptr[tid - 1]]`
   
   since the contents of `arange` seems redundant given other information you already have here, I have a strong feeling that we don't need to materialize `arange`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-786356125


   Thanks @ymwangg @codeislife99, this is really a great work!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582455822



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,384 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name, no-else-return
+"""Unique operator"""
+from tvm import te, tir
+import tvm
+
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, adjacent_diff):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    adjacent_diff_ptr = ib.buffer_ptr(adjacent_diff)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                adjacent_diff_ptr[tid] = 0
+            with ib.else_scope():
+                with ib.if_scope(data_ptr[tid] != data_ptr[tid - 1]):
+                    adjacent_diff_ptr[tid] = 1
+                with ib.else_scope():
+                    adjacent_diff_ptr[tid] = 0
+    return ib.get()
+
+
+@hybrid.script
+def _calc_num_unique(data):
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = data[data.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_sorted_ir(data, argsorted_indices, inc_scan, unique_elements, indices):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            indices_ptr[argsorted_indices_ptr[tid]] = inc_scan_ptr[tid]
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[inc_scan_ptr[tid]] = data_ptr[argsorted_indices_ptr[tid]]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[inc_scan_ptr[tid]] = data_ptr[argsorted_indices_ptr[tid]]
+    return ib.get()
+
+
+def _calc_counts_sorted_ir(inc_scan, counts):
+    ib = tvm.tir.ir_builder.create()
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    counts_ptr = ib.buffer_ptr(counts)
+
+    batch_size = inc_scan.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            counts_ptr[tid] = 0
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        atomic_add_return = ib.allocate(counts.dtype, (1,), name="atomic_add_return", scope="local")
+        with ib.if_scope(tid < batch_size):
+            index = inc_scan_ptr[tid]
+            atomic_add_return[0] = tvm.tir.call_intrin(
+                counts.dtype,
+                "tir.atomic_add",

Review comment:
       Removed to use `arange`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582554234



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       Sorry the name may be confusing. Actually the `arange_ptr` here records the cumulated counts of the unique elements and it goes like:
   Step1: inc_scan = [0, 0, 1, 1, 2, 2, 2, 2, 2, 3]
                       tids = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
   Step2: arange_ptr = [2, 4, 9, 10]
   Step3: counts_ptr = [2, 2, 5, 1]
   
   Maybe we can calculates `counts` first and use `indices_ptr` to replace `arange_ptr`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] codeislife99 commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
codeislife99 commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780307901


   I see, I was interested in `counts` option and probably you might be as well , because `SparseSegmentSqrtN` or other variants `SparseLengthSum` / `EmbeddingBag` can be written as a combination of `take`, `expand`, `repeat`,`scatter_add` and `unique_count`. So I was interested if we could do this either in this PR or in a parallel PR(and later merge them)  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] codeislife99 commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
codeislife99 commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r574220931



##########
File path: python/tvm/relay/op/strategy/generic.py
##########
@@ -1395,3 +1395,28 @@ def cumsum_strategy(attrs, inputs, out_type, target):
         name="cumsum.generic",
     )
     return strategy
+
+
+def wrap_compute_unique(topi_compute):
+    """Wrap unique topi compute"""
+
+    def _compute_unique(attrs, inputs, _):
+        return topi_compute(inputs[0])

Review comment:
       Any reason the shape func is defined but not being used here?  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r583124937



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       I checked the cuda kernels and if they are executed in the order they appear in IR (I think they do), then I don't see any potential issues.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780284547


   I can be a lot simpler than that. Unique is basically sort + adjacent difference + exclusive scan. If you don't understand that statement, the following example should help. We have exclusive scan for CPU (`cumsum` op with `exclusive=True`), and GPU (see https://github.com/apache/tvm/pull/7303).
   
   If we implement unique this way, the same code runs on both CPU and GPU.
   ```
   import numpy as np
   
   
   def exclusive_scan(arr):
       return np.cumsum(arr) - arr
   
   
   inp = np.random.randint(0, 10, size=(15,))
   argsort_indices = np.argsort(inp)
   sorted_inp = np.array([inp[i] for i in argsort_indices])
   print("sorted input:", sorted_inp)
   
   adj_diff = np.concatenate([[sorted_inp[0]],  np.diff(sorted_inp)])
   print("adjacent difference:", adj_diff)
   
   non_zero = adj_diff != 0
   non_zero[0] = True  # the first element is always selected
   ex_scan = exclusive_scan(non_zero)
   print("exclusive scan:", ex_scan)
   
   unique = np.zeros(inp.shape[0], dtype=np.int)
   
   for i in range(inp.shape[0]):
       if non_zero[i] != 0:
           unique[ex_scan[i]] = inp[argsort_indices[i]]
   
   print("num unique element:", ex_scan[-1] + 1)
   print("unique:", unique)
   ```
   
   Output:
   ```
   sorted input: [0 0 0 4 5 5 6 6 6 6 6 7 8 8 9]
   adjacent difference: [0 0 0 4 1 0 1 0 0 0 0 1 1 0 1]
   exclusive scan: [0 1 1 1 2 3 3 4 4 4 4 4 5 6 6]
   num unique element: 7
   unique: [0 4 5 6 7 8 9 0 0 0 0 0 0 0 0]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582455245



##########
File path: python/tvm/relay/op/transform.py
##########
@@ -1463,3 +1463,48 @@ def cumsum(data, axis=None, dtype=None, exclusive=None):
         -> [1, 1, 2, 2, 3, 4, 4]
     """
     return _make.cumsum(data, axis, dtype, exclusive)
+
+
+def unique(data, is_sorted=True, return_counts=False):
+    """
+    Find the unique elements of a tensor
+    Parameters
+    ----------
+    data : relay.Expr
+        A 1-D tensor of integers
+    sorted : bool
+        Whether to sort the unique elements in ascending order before returning as output
+    return_counts : bool
+        Whether to return the array with count of each unique element
+    Returns
+    -------
+    output : relay.Expr
+        A 1-D tensor containing the unique elements of the input data tensor
+    indices : relay.Expr
+        A 1-D tensor containing the index of each data element in the output tensor
+    num_unique : relay.Expr
+        A 0-D tensor containing the number of unique elements in the input data tensor
+    counts (optional) : relay.Expr
+        A 1-D tensor containing the count of each unique element in the output
+    Examples
+    --------

Review comment:
       Added.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582593774



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       I'm imaging that we can have something like (it is probably wrong but hopefully it conveys my idea. Basically we want to find a tid that maps to `arange_ptr[tid - 1]` in your current impl. Subtracting this index from `tid` is the expression we want)
   
   `counts_ptr[inc_scan_ptr[tid] - 1] = tid - indices_ptr[inc_scan_ptr[tid - 1]]`
   
   since the contents of `arange` seems redundant given other information you already have here, I have a strong feeling that we don't need to materialize `arange`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582456386



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,384 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name, no-else-return
+"""Unique operator"""
+from tvm import te, tir
+import tvm
+
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, adjacent_diff):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    adjacent_diff_ptr = ib.buffer_ptr(adjacent_diff)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                adjacent_diff_ptr[tid] = 0
+            with ib.else_scope():
+                with ib.if_scope(data_ptr[tid] != data_ptr[tid - 1]):
+                    adjacent_diff_ptr[tid] = 1
+                with ib.else_scope():
+                    adjacent_diff_ptr[tid] = 0
+    return ib.get()
+
+
+@hybrid.script
+def _calc_num_unique(data):
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = data[data.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_sorted_ir(data, argsorted_indices, inc_scan, unique_elements, indices):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            indices_ptr[argsorted_indices_ptr[tid]] = inc_scan_ptr[tid]
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[inc_scan_ptr[tid]] = data_ptr[argsorted_indices_ptr[tid]]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[inc_scan_ptr[tid]] = data_ptr[argsorted_indices_ptr[tid]]
+    return ib.get()
+
+
+def _calc_counts_sorted_ir(inc_scan, counts):
+    ib = tvm.tir.ir_builder.create()
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    counts_ptr = ib.buffer_ptr(counts)
+
+    batch_size = inc_scan.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            counts_ptr[tid] = 0
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        atomic_add_return = ib.allocate(counts.dtype, (1,), name="atomic_add_return", scope="local")
+        with ib.if_scope(tid < batch_size):
+            index = inc_scan_ptr[tid]
+            atomic_add_return[0] = tvm.tir.call_intrin(
+                counts.dtype,
+                "tir.atomic_add",
+                tvm.tir.call_intrin("handle", "tir.address_of", counts_ptr[index]),
+                1,
+            )
+    return ib.get()
+
+
+def _calc_first_occurence_ir(argsorted_indices, inc_scan, first_occurence):
+    ib = tvm.tir.ir_builder.create()
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    first_occurence_ptr = ib.buffer_ptr(first_occurence)
+    batch_size = argsorted_indices.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            first_occurence_ptr[tid] = batch_size
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                first_occurence_ptr[inc_scan_ptr[tid]] = argsorted_indices_ptr[tid]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    first_occurence_ptr[inc_scan_ptr[tid]] = argsorted_indices_ptr[tid]
+    return ib.get()
+
+
+def _calc_unique_unsorted_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices
+):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    index_converter_ptr = ib.buffer_ptr(index_converter)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            indices_ptr[argsorted_indices_ptr[tid]] = index_converter_ptr[inc_scan_ptr[tid]]
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[index_converter_ptr[inc_scan_ptr[tid]]] = data_ptr[
+                    argsorted_indices_ptr[tid]
+                ]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[index_converter_ptr[inc_scan_ptr[tid]]] = data_ptr[
+                        argsorted_indices_ptr[tid]
+                    ]
+    return ib.get()
+
+
+def _calc_counts_unsorted_ir(inc_scan, index_converter, counts):

Review comment:
       Good suggestion, I was able to combine everything into a single ir function.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-784615965


   @ymwangg @codeislife99 I found a neat trick PyTorch uses for `count`. https://github.com/pytorch/pytorch/blob/22a34bcf4e5eaa348f0117c414c3dd760ec64b13/aten/src/ATen/native/cuda/Unique.cu#L60-L68
   
   Basically, after you get ex scan, instead of copy original inputs, you copy from an array [0, 1, 2, ....]. This will give you something like [0, 2, 5], and doing adjacent element on it directly gives the count. Does this make sense? It should be much faster than atomic.
   
   PyTorch uses a separate `unique_by_key` call to compute counts, but since we have ex scan outputs lying around, we don't need this separate call. So we can be faster than PyTorch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r579910453



##########
File path: python/tvm/relay/frontend/pytorch.py
##########
@@ -2164,6 +2164,21 @@ def is_floating_point(self, inputs, input_types):
         is_float = input_type in ["float32", "float64", "float16", "bfloat16"]
         return _expr.const(is_float)
 
+    def unique(self, inputs, input_types):
+        assert len(inputs) == 4
+        [data, is_sorted, return_inverse, return_counts] = inputs
+        if return_counts:

Review comment:
       yeah, PyTorch doc https://pytorch.org/docs/stable/generated/torch.unique.html says their CUDA unique always returns sorted outputs regardless of `sorted` arg. So I think we can also assume that `is_sorted` is always True and add a warning if `is_sorted` arg is False.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-784643576


   @masahi thanks. I'll try using `arange` and `adjacent_difference` to compute the counts rather than counting by adding.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582731554



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       @ymwangg I made a lot of noise above, but after some thought I realized that probably we cannot remove `arange`: We cannot tell which previous index in arange would correspond to the previous unique element, without materializing `arange` array and doing global sync. In-place adjacent diff obvious doesn't work without some synchronization. 
   
   So maybe we rename `arange` to something like `unique_seq_indices` and keep everything else as is, that would be good?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r581457161



##########
File path: python/tvm/relay/frontend/pytorch.py
##########
@@ -2164,6 +2164,24 @@ def is_floating_point(self, inputs, input_types):
         is_float = input_type in ["float32", "float64", "float16", "bfloat16"]
         return _expr.const(is_float)
 
+    def unique(self, inputs, input_types):
+        assert len(inputs) == 4
+        [data, is_sorted, return_inverse, return_counts] = inputs

Review comment:
       Raise an error if `return_inverse` is True, since we don't support it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-782545404


   @masahi Yeah, I only added CPU version in this PR. I'm not very familiar with GPU IR now but I can do it later. If the overall structure looks good, I can add `unique_with_counts` in the future since their implementations are very similar.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r579581782



##########
File path: python/tvm/topi/unique.py
##########
@@ -0,0 +1,118 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+from ..te import hybrid
+from .cumsum import cumsum
+from .sort import sort, argsort
+
+
+@hybrid.script
+def _calc_adjacent_diff(data):
+    output = output_tensor(data.shape, "int32")
+    output[0] = int32(0)
+    for i in range(1, data.shape[0]):
+        output[i] = int32(1) if data[i] != data[i - 1] else int32(0)
+    return output
+
+
+@hybrid.script
+def _calc_num_unique(data):
+    output = output_tensor((1,), "int32")
+    output[0] = data[data.shape[0] - 1] + 1
+    return output
+
+
+@hybrid.script
+def _calc_unique_sorted(data, argsorted_indices, inc_scan):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    for i in range(data.shape[0]):
+        indices[argsorted_indices[i]] = inc_scan[i]

Review comment:
       actually all loops can be done in parallel, I changed all loops to parallel.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582601606



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       We can also consider filling in `counts` array just like you fill in `arange`, and do in-place `_calc_adjacent_diff` on it directly. 
   
   Anyway, as long as we remove `arange` array, I'm happy.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r579575906



##########
File path: python/tvm/topi/unique.py
##########
@@ -0,0 +1,118 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+from ..te import hybrid
+from .cumsum import cumsum
+from .sort import sort, argsort
+
+
+@hybrid.script
+def _calc_adjacent_diff(data):
+    output = output_tensor(data.shape, "int32")
+    output[0] = int32(0)
+    for i in range(1, data.shape[0]):
+        output[i] = int32(1) if data[i] != data[i - 1] else int32(0)
+    return output
+
+
+@hybrid.script
+def _calc_num_unique(data):
+    output = output_tensor((1,), "int32")
+    output[0] = data[data.shape[0] - 1] + 1
+    return output
+
+
+@hybrid.script
+def _calc_unique_sorted(data, argsorted_indices, inc_scan):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    for i in range(data.shape[0]):
+        indices[argsorted_indices[i]] = inc_scan[i]

Review comment:
       We can do this loop in parallel




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] codeislife99 commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
codeislife99 commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r574221285



##########
File path: src/relay/op/algorithm/unique.cc
##########
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file unique.cc
+ * \brief The unique operator
+ */
+#include <dlpack/dlpack.h>
+#include <tvm/relay/attrs/algorithm.h>
+#include <tvm/relay/op.h>
+#include <tvm/relay/op_attr_types.h>
+#include <tvm/runtime/data_type.h>
+
+namespace tvm {
+namespace relay {
+
+bool UniqueRel(const Array<Type>& types, int num_inputs, const Attrs& attrs,
+               const TypeReporter& reporter) {
+  // types: [data, result]
+  ICHECK_EQ(types.size(), 2) << "Unique: expect 2 types but " << types.size() << " provided";
+  ICHECK_EQ(num_inputs, 1) << "Unique: expect 1 inputs but " << num_inputs << " provided";
+  auto data = types[0].as<TensorTypeNode>();
+  if (data == nullptr) {
+    ICHECK(types[0].as<IncompleteTypeNode>())
+        << "Unique: expect input type to be TensorType but get " << types[0];
+    return false;
+  }
+  std::vector<Type> fields;
+  fields.push_back(TensorType(data->shape, data->dtype));

Review comment:
       Instead of having a full tensor and then following it with strided slice , how about we make it totally dynamic itself ? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582488750



##########
File path: python/tvm/relay/frontend/pytorch.py
##########
@@ -2164,6 +2164,24 @@ def is_floating_point(self, inputs, input_types):
         is_float = input_type in ["float32", "float64", "float16", "bfloat16"]
         return _expr.const(is_float)
 
+    def unique(self, inputs, input_types):
+        assert len(inputs) == 4
+        [data, is_sorted, return_inverse, return_counts] = inputs

Review comment:
       ah sorry you are right. So in Torchscript `unique` always returns two outputs, regardless of `return_inverse` option? The python one https://pytorch.org/docs/stable/generated/torch.unique.html can return only one output.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582587901



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       I didn't get why you need to fill in `arange_ptr` with cumulated counts or need a temp buffer. If you move the condition `ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1])` at L200 here, can't you inline the difference computation without materializing `arange_ptr`? I guess you will be writing at index `inc_scan_ptr[tid] - 1` instead of `unique_idx`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582457034



##########
File path: python/tvm/topi/unique.py
##########
@@ -0,0 +1,181 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name, no-else-return
+"""Unique operator"""
+from ..te import hybrid
+from .cumsum import cumsum
+from .sort import sort, argsort
+
+
+@hybrid.script
+def _calc_adjacent_diff(data):
+    output = output_tensor(data.shape, "int32")
+    output[0] = int32(0)
+    for i in parallel(1, data.shape[0]):
+        output[i] = int32(1) if data[i] != data[i - 1] else int32(0)
+    return output
+
+
+@hybrid.script
+def _calc_num_unique(data):
+    output = output_tensor((1,), "int32")
+    output[0] = data[data.shape[0] - 1] + int32(1)
+    return output
+
+
+@hybrid.script
+def _calc_unique_sorted(data, argsorted_indices, inc_scan):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        indices[argsorted_indices[i]] = inc_scan[i]
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[inc_scan[i]] = data[argsorted_indices[i]]
+    return unique_elements, indices
+
+
+@hybrid.script
+def _calc_unique_sorted_with_counts(data, argsorted_indices, inc_scan):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    counts = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        counts[i] = int32(0)
+    for i in parallel(data.shape[0]):
+        indices[argsorted_indices[i]] = inc_scan[i]
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[inc_scan[i]] = data[argsorted_indices[i]]
+    for i in range(data.shape[0]):
+        counts[inc_scan[i]] += int32(1)
+    return unique_elements, indices, counts
+
+
+@hybrid.script
+def _calc_first_occurence(argsorted_indices, inc_scan):
+    first_occurence = output_tensor(argsorted_indices.shape, "int32")
+    for i in parallel(argsorted_indices.shape[0]):
+        first_occurence[i] = argsorted_indices.shape[0]
+    for i in parallel(argsorted_indices.shape[0]):
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            first_occurence[inc_scan[i]] = argsorted_indices[i]
+    return first_occurence
+
+
+@hybrid.script
+def _calc_unique_unsorted(data, argsorted_indices, inc_scan, index_converter):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        new_unique_idx = index_converter[inc_scan[i]]
+        new_data_idx = argsorted_indices[i]
+        indices[new_data_idx] = new_unique_idx
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[new_unique_idx] = data[new_data_idx]
+    return unique_elements, indices
+
+
+@hybrid.script
+def _calc_unique_unsorted_with_counts(data, argsorted_indices, inc_scan, index_converter):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    counts = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        counts[i] = int32(0)
+    for i in parallel(data.shape[0]):
+        new_unique_idx = index_converter[inc_scan[i]]
+        new_data_idx = argsorted_indices[i]
+        indices[new_data_idx] = new_unique_idx
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[new_unique_idx] = data[new_data_idx]
+    for i in range(data.shape[0]):
+        idx = index_converter[inc_scan[i]]
+        counts[idx] += int32(1)
+    return unique_elements, indices, counts
+

Review comment:
       I have changed hybrid script to ir builder similar to the GPU implementation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r579580281



##########
File path: tests/python/relay/test_op_level3.py
##########
@@ -1453,5 +1453,53 @@ def verify_scatter_nd_with_stack(data_np, indices_np, shape, ref_res, rtol=1e-5,
     verify_scatter_nd_with_stack(data, indices, shape, out)
 
 
+@tvm.testing.uses_gpu
+def test_unique():
+    def calc_numpy_unique(data, is_sorted=False):
+        uniq, index, inverse, counts = np.unique(
+            data, return_index=True, return_inverse=True, return_counts=True
+        )
+        num_uniq = np.array([len(uniq)]).astype("int32")
+        if not is_sorted:
+            order = np.argsort(index)
+            reverse_order = np.argsort(order)
+            uniq = uniq[order].astype(data.dtype)
+            inverse = np.array([reverse_order[i] for i in inverse]).astype("int32")
+            counts = counts[order].astype("int32")
+        return [uniq.astype(data.dtype), inverse.astype("int32"), counts, num_uniq]
+
+    def verify_unique(n, dtype, is_dyn=False, is_sorted=False):
+        if is_dyn:
+            x = relay.var("x", relay.TensorType([relay.Any()], dtype))
+        else:
+            x = relay.var("x", relay.TensorType([n], dtype))
+        outs = relay.unique(x, is_sorted)
+        outs = outs.astuple()
+        func = relay.Function([x], outs)
+        x_data = np.random.randint(50, size=n).astype(dtype)
+
+        if is_dyn:
+            backends = ["vm", "debug"]
+        else:
+            backends = ["graph", "debug"]
+        for target, ctx in tvm.testing.enabled_targets():

Review comment:
       thanks, will fix it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-786357373


   @masahi thanks for making this such an interesting project!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-777861757


   @ymwangg For a general op like `unique`, we should follow numpy API, rather than being too specific to TF. PyTorch unique should be supported by the same API. Framework specific details should go into the frontend.
   
   Numpy and PyTorch supports `dim` argument to do unique on multidimensional input, but I don't think it's a good idea. So restricting to 1D, at least for the first implementation, sounds good to me.
   
   We can implement `unique` via sorting and cumsum (without hash table). If implemented this way, the same code works on both CPU and GPU. That's I'm planning to do, but if you feel brave, you can try that in this PR :slightly_smiling_face:  But it is likely not going to be faster than the hash table based implementation, since it requires multiple passes over input. This could be useful if the hash based impl cannot be used for some reason.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-782585022


   @masahi I added the `return_counts` option for the `topi.unique` operator. I also added pytorch frontend. Interestingly, it looks like pytorch returns unique elements in random order when `sorted=False`.
   
   I'll work on the GPU version of `unique` next week.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582549403



##########
File path: python/tvm/relay/frontend/pytorch.py
##########
@@ -2164,6 +2164,24 @@ def is_floating_point(self, inputs, input_types):
         is_float = input_type in ["float32", "float64", "float16", "bfloat16"]
         return _expr.const(is_float)
 
+    def unique(self, inputs, input_types):
+        assert len(inputs) == 4
+        [data, is_sorted, return_inverse, return_counts] = inputs

Review comment:
       Yes, I found `aten::_unique2`, the function that `torch.unique` dispatches to when `dim=None`, always return indices. https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Unique.cpp#L69 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780305205


   Yes, it's possible but a bit complicated. PyTorch also has `return_counts` option https://pytorch.org/docs/stable/generated/torch.unique.html
   
   I think for the first PR, not all options need to be implemented. We can follow up later.
   
   I'm using PyTorch GPU impl as reference, see for example below on how they support count
   https://github.com/pytorch/pytorch/blob/22a34bcf4e5eaa348f0117c414c3dd760ec64b13/aten/src/ATen/native/cuda/Unique.cu#L60-L68


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582488750



##########
File path: python/tvm/relay/frontend/pytorch.py
##########
@@ -2164,6 +2164,24 @@ def is_floating_point(self, inputs, input_types):
         is_float = input_type in ["float32", "float64", "float16", "bfloat16"]
         return _expr.const(is_float)
 
+    def unique(self, inputs, input_types):
+        assert len(inputs) == 4
+        [data, is_sorted, return_inverse, return_counts] = inputs

Review comment:
       ah sorry you are right. So in Torchscript `unique` always returns indices, regardless of `return_inverse` option? The python one https://pytorch.org/docs/stable/generated/torch.unique.html can return only one output.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-782546478


   I can do the GPU version. It will likely require ir builder. But let me know if you *want* to do GPU as well, you can certainly do it. The idea is identical with CPU version, just using different parallelization.
   
   If `unique_with_counts` can be supported by adding another option to `unique`, that sounds good. We shouldn't add `relay.unique_with_counts` or `topi.unique_with_counts`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-777695666


   > Thanks, I was planning to work on unique next week, happy to collaborate.
   > 
   > I can add TIR unqiue impl both cpu and gpu later. We can add relay boilarplate, temp impl in cpp, and tests in this PR.
   
   That would be great!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r574883580



##########
File path: python/tvm/relay/op/_transform.py
##########
@@ -885,3 +886,29 @@ def where_shape_func(attrs, inputs, _):
     out_shape = _broadcast_shape_tensors(bcast_shape, cond_shape)
 
     return [out_shape]
+
+
+register_strategy("unique", strategy.unique_strategy)
+register_pattern("unique", OpPattern.OPAQUE)
+
+
+@script
+def _unique_shape_1(data_shape):
+    shape_tensor = output_tensor((1,), "int64")
+    shape_tensor[0] = int64(data_shape[0])
+    return shape_tensor
+
+
+@script
+def _unique_shape_2(inputs):
+    shape_tensor = output_tensor((1,), "int64")
+    shape_tensor[0] = int64(1)
+    return shape_tensor
+
+
+@_reg.register_shape_func("unique", False)
+def unique_shape_func(attrs, inputs, _):
+    """
+    Shape func for unique operator.
+    """
+    return [_unique_shape_1(inputs[0]), _unique_shape_1(inputs[0]), _unique_shape_1(inputs[0]), _unique_shape_2(inputs[0])]

Review comment:
       Sure, I'll make it simpler.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-777861757


   @ymwangg For a general op like `unique`, we should follow numpy API, rather than being too specific to TF. PyTorch unique should be supported by the same API. Framework specific details should go into the frontend.
   
   Numpy and PyTorch supports `dim` argument to do unique on multidimensional input, but I don't think it's a good idea. So restricting to 1D, at least for the first implementation, sounds good to me.
   
   We can implement `unique` via sorting and cumsum (without hash table). If implemented this way, the same code works on both CPU and GPU. That's I'm planning to do, but if you feel brave, you can try that in this PR :slightly_smiling_face:  But I'm not sure if it is going to be faster than the hash table based implementation, since it requires multiple passes over input.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780284547


   I can be a lot simpler than that. Unique is basically sort + adjacent difference + exclusive scan. If you don't understand that statement, the following example should help. We have exclusive scan for CPU (`cumsum` op with `exclusive=True`), and GPU (see https://github.com/apache/tvm/pull/7303).
   
   ```
   import numpy as np
   
   
   def exclusive_scan(arr):
       return np.cumsum(arr) - arr
   
   
   inp = np.random.randint(0, 10, size=(15,))
   argsort_indices = np.argsort(inp)
   sorted_inp = np.array([inp[i] for i in argsort_indices])
   print("sorted input:", sorted_inp)
   adj_diff = np.concatenate([[sorted_inp[0]],  np.diff(sorted_inp)])
   print("adjacent difference:", adj_diff)
   non_zero = adj_diff != 0
   non_zero[0] = True  # the first element is always selected
   ex_scan = exclusive_scan(non_zero)
   print("exclusive scan:", ex_scan)
   
   unique = np.zeros(inp.shape[0], dtype=np.int)
   for i in range(inp.shape[0]):
       if non_zero[i] != 0:
           unique[ex_scan[i]] = inp[argsort_indices[i]]
   
   print("num unique element:", ex_scan[-1] + 1)
   print("unique:", unique)
   ```
   
   Output:
   ```
   sorted input: [0 0 0 4 5 5 6 6 6 6 6 7 8 8 9]
   adjacent difference: [0 0 0 4 1 0 1 0 0 0 0 1 1 0 1]
   exclusive scan: [0 1 1 1 2 3 3 4 4 4 4 4 5 6 6]
   num unique element: 7
   unique: [0 4 5 6 7 8 9 0 0 0 0 0 0 0 0]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r579910453



##########
File path: python/tvm/relay/frontend/pytorch.py
##########
@@ -2164,6 +2164,21 @@ def is_floating_point(self, inputs, input_types):
         is_float = input_type in ["float32", "float64", "float16", "bfloat16"]
         return _expr.const(is_float)
 
+    def unique(self, inputs, input_types):
+        assert len(inputs) == 4
+        [data, is_sorted, return_inverse, return_counts] = inputs
+        if return_counts:

Review comment:
       yeah, PyTorch doc https://pytorch.org/docs/stable/generated/torch.unique.html says their CUDA unique always returns sorted outputs regardless of `sorted` arg. So I think we can also assume that `is_sorted` is always True and add a warning if `sorted` arg is False.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r579575955



##########
File path: python/tvm/topi/unique.py
##########
@@ -0,0 +1,118 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+from ..te import hybrid
+from .cumsum import cumsum
+from .sort import sort, argsort
+
+
+@hybrid.script
+def _calc_adjacent_diff(data):
+    output = output_tensor(data.shape, "int32")
+    output[0] = int32(0)
+    for i in range(1, data.shape[0]):

Review comment:
       Parallel




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r581460354



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,384 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name, no-else-return
+"""Unique operator"""
+from tvm import te, tir
+import tvm
+
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, adjacent_diff):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    adjacent_diff_ptr = ib.buffer_ptr(adjacent_diff)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                adjacent_diff_ptr[tid] = 0
+            with ib.else_scope():
+                with ib.if_scope(data_ptr[tid] != data_ptr[tid - 1]):
+                    adjacent_diff_ptr[tid] = 1
+                with ib.else_scope():
+                    adjacent_diff_ptr[tid] = 0
+    return ib.get()
+
+
+@hybrid.script
+def _calc_num_unique(data):
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = data[data.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_sorted_ir(data, argsorted_indices, inc_scan, unique_elements, indices):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            indices_ptr[argsorted_indices_ptr[tid]] = inc_scan_ptr[tid]
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[inc_scan_ptr[tid]] = data_ptr[argsorted_indices_ptr[tid]]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[inc_scan_ptr[tid]] = data_ptr[argsorted_indices_ptr[tid]]
+    return ib.get()
+
+
+def _calc_counts_sorted_ir(inc_scan, counts):
+    ib = tvm.tir.ir_builder.create()
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    counts_ptr = ib.buffer_ptr(counts)
+
+    batch_size = inc_scan.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            counts_ptr[tid] = 0
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        atomic_add_return = ib.allocate(counts.dtype, (1,), name="atomic_add_return", scope="local")
+        with ib.if_scope(tid < batch_size):
+            index = inc_scan_ptr[tid]
+            atomic_add_return[0] = tvm.tir.call_intrin(
+                counts.dtype,
+                "tir.atomic_add",

Review comment:
       hmm I haven't looked into detail at all, but atomic sounds like a code smell... I'll think about if there is a way to avoid it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r583092737



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       I'd hope that TVM could automatically do such buffer reuse optimization, but not sure if it would work in this case (can check by dumping the cuda source). I'm also ok with having `arange` explicitly for clarity. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-777861757


   @ymwangg For a general op like `unique`, we should follow numpy API, rather than being too specific to TF. PyTorch unique should be supported by the same API. Framework specific details should go into the frontend.
   
   Numpy and PyTorch supports `dim` argument to do unique on multidimensional input, but I don't think it's a good idea. So restricting to 1D, at least for the first implementation, sounds good to me.
   
   We can implement `unique` via sorting and cumsum (without hash table). If implemented this way, the same code works on both CPU and GPU. That's I'm planning to do, but if you feel brave, you can try that in this PR :slightly_smiling_face: 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780440200


   For your first implementation, combination-based approach is ok. But `unique` is important enough that I think it deserves its own operator. Also implementation directly in ir builder will likely be faster. Supporting other options will also be easier if we write in ir builder.
   
   So use ir builder if you are comfortable with it, otherwise combination of relay ops is fine. Performance + support for options can be done later (by me).  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r579581566



##########
File path: python/tvm/topi/unique.py
##########
@@ -0,0 +1,118 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+from ..te import hybrid
+from .cumsum import cumsum
+from .sort import sort, argsort
+
+
+@hybrid.script
+def _calc_adjacent_diff(data):
+    output = output_tensor(data.shape, "int32")
+    output[0] = int32(0)
+    for i in range(1, data.shape[0]):

Review comment:
       changed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582731554



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       @ymwangg I made a lot of noise above, but after some thought I realized that probably we cannot remove `arange`: We cannot tell which previous index in arange would correspond to the previous unique element, without materializing `arange` array and doing global sync. In-place adjacent diff obviously doesn't work without some synchronization. 
   
   So maybe we rename `arange` to something like `unique_seq_indices` and keep everything else as is, that would be good?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-782513594


   Looks good :+1: GPU is not supported right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] codeislife99 commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
codeislife99 commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780303605


   Hey @masahi , can your example be extended to provide `counts` as well ? https://www.tensorflow.org/api_docs/python/tf/unique_with_counts 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r583102978



##########
File path: python/tvm/relay/frontend/tensorflow.py
##########
@@ -2324,6 +2324,39 @@ def _impl(inputs, attr, params, mod):
     return _impl
 
 
+def _unique(return_counts=True):
+    def _impl(inputs, attr, params, mod):
+        assert len(inputs) == 1
+        data = inputs[0]
+        if return_counts:
+            [unique, indices, num_uniq, counts] = _op.unique(
+                data, is_sorted=False, return_counts=True
+            )
+            unique_sliced = _op.strided_slice(unique, begin=[0], end=num_uniq, slice_mode="size")
+            counts_sliced = _op.strided_slice(counts, begin=[0], end=num_uniq, slice_mode="size")
+            return _expr.TupleWrapper(
+                _expr.Tuple([unique_sliced, indices, counts_sliced]),
+                3,
+            )
+        else:
+            [unique, indices, num_uniq] = _op.unique(data, is_sorted=False, return_counts=False)
+            unique_sliced = _op.strided_slice(unique, begin=[0], end=num_uniq, slice_mode="size")
+            return _expr.TupleWrapper(
+                _expr.Tuple([unique_sliced, indices]),
+                2,
+            )
+
+    return _impl
+
+
+def _unique_with_counts():
+    def _impl(inputs, attr, params, mod):

Review comment:
       removed, thanks for the catch.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582554234



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       Sorry the name may be confusing. Actually `arange_ptr` here stores the cumulated counts of the unique elements and it goes like:
   Step1: inc_scan = [0, 0, 1, 1, 2, 2, 2, 2, 2, 3]
                       tids = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
   Step2: arange_ptr = [2, 4, 9, 10]
   Step3: counts_ptr = [2, 2, 5, 1]
   
   Maybe we can calculates `counts` first and use `indices_ptr` as a temp buffer to replace `arange_ptr`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r581457533



##########
File path: python/tvm/relay/op/transform.py
##########
@@ -1463,3 +1463,48 @@ def cumsum(data, axis=None, dtype=None, exclusive=None):
         -> [1, 1, 2, 2, 3, 4, 4]
     """
     return _make.cumsum(data, axis, dtype, exclusive)
+
+
+def unique(data, is_sorted=True, return_counts=False):
+    """
+    Find the unique elements of a tensor
+    Parameters
+    ----------
+    data : relay.Expr
+        A 1-D tensor of integers
+    sorted : bool
+        Whether to sort the unique elements in ascending order before returning as output
+    return_counts : bool
+        Whether to return the array with count of each unique element
+    Returns
+    -------
+    output : relay.Expr
+        A 1-D tensor containing the unique elements of the input data tensor
+    indices : relay.Expr
+        A 1-D tensor containing the index of each data element in the output tensor
+    num_unique : relay.Expr
+        A 0-D tensor containing the number of unique elements in the input data tensor
+    counts (optional) : relay.Expr
+        A 1-D tensor containing the count of each unique element in the output
+    Examples
+    --------

Review comment:
       Please mention that the outputs after `num_unique` are undefined. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582601606



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       We can also consider filling in `counts` array just like you fill in `arange`, and do in-place `_calc_adjacent_diff` on it directly. 
   
   Anyway, as long as we remove `arange` array, I'm happy. Can you try put some thought into this? If it is not possible, it is totally fine, of course.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582513196



##########
File path: python/tvm/topi/unique.py
##########
@@ -0,0 +1,181 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name, no-else-return
+"""Unique operator"""
+from ..te import hybrid
+from .cumsum import cumsum
+from .sort import sort, argsort
+
+
+@hybrid.script
+def _calc_adjacent_diff(data):
+    output = output_tensor(data.shape, "int32")
+    output[0] = int32(0)
+    for i in parallel(1, data.shape[0]):
+        output[i] = int32(1) if data[i] != data[i - 1] else int32(0)
+    return output
+
+
+@hybrid.script
+def _calc_num_unique(data):
+    output = output_tensor((1,), "int32")
+    output[0] = data[data.shape[0] - 1] + int32(1)
+    return output
+
+
+@hybrid.script
+def _calc_unique_sorted(data, argsorted_indices, inc_scan):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        indices[argsorted_indices[i]] = inc_scan[i]
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[inc_scan[i]] = data[argsorted_indices[i]]
+    return unique_elements, indices
+
+
+@hybrid.script
+def _calc_unique_sorted_with_counts(data, argsorted_indices, inc_scan):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    counts = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        counts[i] = int32(0)
+    for i in parallel(data.shape[0]):
+        indices[argsorted_indices[i]] = inc_scan[i]
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[inc_scan[i]] = data[argsorted_indices[i]]
+    for i in range(data.shape[0]):
+        counts[inc_scan[i]] += int32(1)
+    return unique_elements, indices, counts
+
+
+@hybrid.script
+def _calc_first_occurence(argsorted_indices, inc_scan):
+    first_occurence = output_tensor(argsorted_indices.shape, "int32")
+    for i in parallel(argsorted_indices.shape[0]):
+        first_occurence[i] = argsorted_indices.shape[0]
+    for i in parallel(argsorted_indices.shape[0]):
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            first_occurence[inc_scan[i]] = argsorted_indices[i]
+    return first_occurence
+
+
+@hybrid.script
+def _calc_unique_unsorted(data, argsorted_indices, inc_scan, index_converter):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        new_unique_idx = index_converter[inc_scan[i]]
+        new_data_idx = argsorted_indices[i]
+        indices[new_data_idx] = new_unique_idx
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[new_unique_idx] = data[new_data_idx]
+    return unique_elements, indices
+
+
+@hybrid.script
+def _calc_unique_unsorted_with_counts(data, argsorted_indices, inc_scan, index_converter):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    counts = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        counts[i] = int32(0)
+    for i in parallel(data.shape[0]):
+        new_unique_idx = index_converter[inc_scan[i]]
+        new_data_idx = argsorted_indices[i]
+        indices[new_data_idx] = new_unique_idx
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[new_unique_idx] = data[new_data_idx]
+    for i in range(data.shape[0]):
+        idx = index_converter[inc_scan[i]]
+        counts[idx] += int32(1)
+    return unique_elements, indices, counts
+

Review comment:
       It looks much cleaner now, very nice. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582554234



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       Sorry the name may be confusing. Actually `arange_ptr` here stores the cumulated counts of the unique elements and it goes like:
   Step1: inc_scan = [0, 0, 1, 1, 2, 2, 2, 2, 2, 3]
                       tids = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
   Step2: arange_ptr = [2, 4, 9, 10]
   Step3: counts_ptr = [2, 2, 5, 1]
   
   Maybe we can calculates `counts` first and use `indices_ptr` to replace `arange_ptr`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-784615965


   @ymwangg @codeislife99 I found a neat trick PyTorch uses for `count`. https://github.com/pytorch/pytorch/blob/22a34bcf4e5eaa348f0117c414c3dd760ec64b13/aten/src/ATen/native/cuda/Unique.cu#L60-L68
   
   Basically, after you get ex scan, instead of copying from the original input, you copy from an array [0, 1, 2, ....]. This will give you something like [0, 2, 5], and doing adjacent element on it directly gives the count. Does this make sense? It should be much faster than atomic.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r581458201



##########
File path: python/tvm/topi/unique.py
##########
@@ -0,0 +1,181 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name, no-else-return
+"""Unique operator"""
+from ..te import hybrid
+from .cumsum import cumsum
+from .sort import sort, argsort
+
+
+@hybrid.script
+def _calc_adjacent_diff(data):
+    output = output_tensor(data.shape, "int32")
+    output[0] = int32(0)
+    for i in parallel(1, data.shape[0]):
+        output[i] = int32(1) if data[i] != data[i - 1] else int32(0)
+    return output
+
+
+@hybrid.script
+def _calc_num_unique(data):
+    output = output_tensor((1,), "int32")
+    output[0] = data[data.shape[0] - 1] + int32(1)
+    return output
+
+
+@hybrid.script
+def _calc_unique_sorted(data, argsorted_indices, inc_scan):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        indices[argsorted_indices[i]] = inc_scan[i]
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[inc_scan[i]] = data[argsorted_indices[i]]
+    return unique_elements, indices
+
+
+@hybrid.script
+def _calc_unique_sorted_with_counts(data, argsorted_indices, inc_scan):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    counts = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        counts[i] = int32(0)
+    for i in parallel(data.shape[0]):
+        indices[argsorted_indices[i]] = inc_scan[i]
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[inc_scan[i]] = data[argsorted_indices[i]]
+    for i in range(data.shape[0]):
+        counts[inc_scan[i]] += int32(1)
+    return unique_elements, indices, counts
+
+
+@hybrid.script
+def _calc_first_occurence(argsorted_indices, inc_scan):
+    first_occurence = output_tensor(argsorted_indices.shape, "int32")
+    for i in parallel(argsorted_indices.shape[0]):
+        first_occurence[i] = argsorted_indices.shape[0]
+    for i in parallel(argsorted_indices.shape[0]):
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            first_occurence[inc_scan[i]] = argsorted_indices[i]
+    return first_occurence
+
+
+@hybrid.script
+def _calc_unique_unsorted(data, argsorted_indices, inc_scan, index_converter):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        new_unique_idx = index_converter[inc_scan[i]]
+        new_data_idx = argsorted_indices[i]
+        indices[new_data_idx] = new_unique_idx
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[new_unique_idx] = data[new_data_idx]
+    return unique_elements, indices
+
+
+@hybrid.script
+def _calc_unique_unsorted_with_counts(data, argsorted_indices, inc_scan, index_converter):
+    unique_elements = output_tensor(data.shape, data.dtype)
+    indices = output_tensor(data.shape, "int32")
+    counts = output_tensor(data.shape, "int32")
+    for i in parallel(data.shape[0]):
+        counts[i] = int32(0)
+    for i in parallel(data.shape[0]):
+        new_unique_idx = index_converter[inc_scan[i]]
+        new_data_idx = argsorted_indices[i]
+        indices[new_data_idx] = new_unique_idx
+        if i == 0 or inc_scan[i] != inc_scan[i - 1]:
+            unique_elements[new_unique_idx] = data[new_data_idx]
+    for i in range(data.shape[0]):
+        idx = index_converter[inc_scan[i]]
+        counts[idx] += int32(1)
+    return unique_elements, indices, counts
+

Review comment:
       I'm definitely seeing duplicated code here, but that's one I reason I don't like hybridscript: It doesn't have any abstraction like function




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582488750



##########
File path: python/tvm/relay/frontend/pytorch.py
##########
@@ -2164,6 +2164,24 @@ def is_floating_point(self, inputs, input_types):
         is_float = input_type in ["float32", "float64", "float16", "bfloat16"]
         return _expr.const(is_float)
 
+    def unique(self, inputs, input_types):
+        assert len(inputs) == 4
+        [data, is_sorted, return_inverse, return_counts] = inputs

Review comment:
       ah sorry you are right. So in Torchscript `unique` always returns indices, regardless of `return_inverse` option? The python one https://pytorch.org/docs/stable/generated/torch.unique.html doesn't return indices by default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780284547


   I can be a lot simpler than that. Unique is basically sort + adjacent difference + exclusive scan. If you don't understand that statement, the following example should help. We have exclusive scan for CPU (`cumsum` op with `exclusive=True`), and GPU (see https://github.com/apache/tvm/pull/7303).
   
   ```
   import numpy as np
   
   
   def exclusive_scan(arr):
       return np.cumsum(arr) - arr
   
   
   inp = np.random.randint(0, 10, size=(15,))
   argsort_indices = np.argsort(inp)
   sorted_inp = np.array([inp[i] for i in argsort_indices])
   print("sorted input:", sorted_inp)
   
   adj_diff = np.concatenate([[sorted_inp[0]],  np.diff(sorted_inp)])
   print("adjacent difference:", adj_diff)
   
   non_zero = adj_diff != 0
   non_zero[0] = True  # the first element is always selected
   ex_scan = exclusive_scan(non_zero)
   print("exclusive scan:", ex_scan)
   
   unique = np.zeros(inp.shape[0], dtype=np.int)
   
   for i in range(inp.shape[0]):
       if non_zero[i] != 0:
           unique[ex_scan[i]] = inp[argsort_indices[i]]
   
   print("num unique element:", ex_scan[-1] + 1)
   print("unique:", unique)
   ```
   
   Output:
   ```
   sorted input: [0 0 0 4 5 5 6 6 6 6 6 7 8 8 9]
   adjacent difference: [0 0 0 4 1 0 1 0 0 0 0 1 1 0 1]
   exclusive scan: [0 1 1 1 2 3 3 4 4 4 4 4 5 6 6]
   num unique element: 7
   unique: [0 4 5 6 7 8 9 0 0 0 0 0 0 0 0]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r574881936



##########
File path: src/relay/op/algorithm/unique.cc
##########
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file unique.cc
+ * \brief The unique operator
+ */
+#include <dlpack/dlpack.h>
+#include <tvm/relay/attrs/algorithm.h>
+#include <tvm/relay/op.h>
+#include <tvm/relay/op_attr_types.h>
+#include <tvm/runtime/data_type.h>
+
+namespace tvm {
+namespace relay {
+
+bool UniqueRel(const Array<Type>& types, int num_inputs, const Attrs& attrs,
+               const TypeReporter& reporter) {
+  // types: [data, result]
+  ICHECK_EQ(types.size(), 2) << "Unique: expect 2 types but " << types.size() << " provided";
+  ICHECK_EQ(num_inputs, 1) << "Unique: expect 1 inputs but " << num_inputs << " provided";
+  auto data = types[0].as<TensorTypeNode>();
+  if (data == nullptr) {
+    ICHECK(types[0].as<IncompleteTypeNode>())
+        << "Unique: expect input type to be TensorType but get " << types[0];
+    return false;
+  }
+  std::vector<Type> fields;
+  fields.push_back(TensorType(data->shape, data->dtype));

Review comment:
       The main challenge is the shape function. This operator requires data structures like set to efficiently calculate the number of unique elements. I don't think that's supported in hybrid script. Even if we can do it and get the shape function. It looks weird that both shape function and the compute function do the same computations.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582490524



##########
File path: python/tvm/relay/frontend/tensorflow.py
##########
@@ -2324,6 +2324,35 @@ def _impl(inputs, attr, params, mod):
     return _impl
 
 
+def _unique():
+    def _impl(inputs, attr, params, mod):
+        assert len(inputs) == 1
+        data = inputs[0]
+        [unique, indices, num_uniq] = _op.unique(data, is_sorted=False, return_counts=False)
+        unique_sliced = _op.strided_slice(unique, begin=[0], end=num_uniq, slice_mode="size")
+        return _expr.TupleWrapper(
+            _expr.Tuple([unique_sliced, indices]),
+            2,
+        )
+
+    return _impl
+
+
+def _unique_with_counts():
+    def _impl(inputs, attr, params, mod):
+        assert len(inputs) == 1
+        data = inputs[0]
+        [unique, indices, num_uniq, counts] = _op.unique(data, is_sorted=False, return_counts=True)
+        unique_sliced = _op.strided_slice(unique, begin=[0], end=num_uniq, slice_mode="size")
+        counts_sliced = _op.strided_slice(counts, begin=[0], end=num_uniq, slice_mode="size")
+        return _expr.TupleWrapper(
+            _expr.Tuple([unique_sliced, indices, counts_sliced]),
+            3,
+        )
+
+    return _impl
+

Review comment:
       probably we can share the implementation in two converters, just pass `return_counts` to `_unique()` and _`unique_with_counts()`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-777181989


   Thanks, I was planning to work on unique next week, happy to collaborate.
   
   I can add TIR unqiue impl both cpu and gpu later.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi merged pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi merged pull request #7441:
URL: https://github.com/apache/tvm/pull/7441


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780284547


   It can be a lot simpler than that. Unique is basically sort + adjacent difference + exclusive scan. If you don't understand that statement, the following example should help. We have exclusive scan for CPU (`cumsum` op with `exclusive=True`), and GPU (see https://github.com/apache/tvm/pull/7303).
   
   If we implement unique this way, the same code runs on both CPU and GPU.
   ```
   import numpy as np
   
   
   def exclusive_scan(arr):
       return np.cumsum(arr) - arr
   
   
   inp = np.random.randint(0, 10, size=(15,))
   argsort_indices = np.argsort(inp)
   sorted_inp = np.array([inp[i] for i in argsort_indices])
   print("sorted input:", sorted_inp)
   
   adj_diff = np.concatenate([[1],  np.diff(sorted_inp)])
   print("adjacent difference:", adj_diff)
   
   non_zero = adj_diff != 0
   ex_scan = exclusive_scan(non_zero)
   print("exclusive scan:", ex_scan)
   
   unique = np.zeros(inp.shape[0], dtype=np.int)
   
   for i in range(inp.shape[0]):
       if non_zero[i] != 0:
           unique[ex_scan[i]] = inp[argsort_indices[i]]
   
   print("num unique element:", ex_scan[-1] + 1)
   print("unique:", unique)
   ```
   
   Output:
   ```
   sorted input: [0 0 0 4 5 5 6 6 6 6 6 7 8 8 9]
   adjacent difference: [0 0 0 4 1 0 1 0 0 0 0 1 1 0 1]
   exclusive scan: [0 1 1 1 2 3 3 4 4 4 4 4 5 6 6]
   num unique element: 7
   unique: [0 4 5 6 7 8 9 0 0 0 0 0 0 0 0]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582455040



##########
File path: python/tvm/relay/frontend/pytorch.py
##########
@@ -2164,6 +2164,24 @@ def is_floating_point(self, inputs, input_types):
         is_float = input_type in ["float32", "float64", "float16", "bfloat16"]
         return _expr.const(is_float)
 
+    def unique(self, inputs, input_types):
+        assert len(inputs) == 4
+        [data, is_sorted, return_inverse, return_counts] = inputs

Review comment:
       What do you mean? The implementation calculates inverse indices all the time.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-780387355


   @masahi Thanks for the explanation and it is very helpful!
   It looks like the main thing we need to do is to implement a `topi.adjacent_difference` op similar to `thrust::adjacent_difference`. And in the frontend, we do something like:
   ```python
   sorted_data = relay.sort(data)
   argsort_indices = relay.argsort(data)
   adj_diff = relay.adjacent_difference(sorted_data, first_value=0, "not_equal")
   ex_scan = relay.cumsum(adj_diff, exclusive=True)
   inverse_indices = relay.scatter(data, argsort_indices, ex_scan)
   unique = relay.scatter(data, ex_scan, sorted_data)
   unique_sliced = relay.strided_slice(unique, [0], relay.take(ex_scan,[-1]), slice_mode="size")
   return unique_sliced, inverse_indices
   ```
   I saw PyTorch uses `thrust::unique` to get the `unique` array. I think we can use `relay.scatter` to do the same thing.
   
   To support counting, it looks like we need to implement a `topi.unique_by_key` op similar to `thrust::unique_by_key`. I think maybe we can do it in a different PR and focus on `adjacent_difference` in this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] ymwangg commented on pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
ymwangg commented on pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#issuecomment-784501849


   @masahi I added the GPU version and it's ready for review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582513922



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,384 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name, no-else-return
+"""Unique operator"""
+from tvm import te, tir
+import tvm
+
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, adjacent_diff):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    adjacent_diff_ptr = ib.buffer_ptr(adjacent_diff)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                adjacent_diff_ptr[tid] = 0
+            with ib.else_scope():
+                with ib.if_scope(data_ptr[tid] != data_ptr[tid - 1]):
+                    adjacent_diff_ptr[tid] = 1
+                with ib.else_scope():
+                    adjacent_diff_ptr[tid] = 0
+    return ib.get()
+
+
+@hybrid.script
+def _calc_num_unique(data):
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = data[data.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_sorted_ir(data, argsorted_indices, inc_scan, unique_elements, indices):
+    ib = tvm.tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            indices_ptr[argsorted_indices_ptr[tid]] = inc_scan_ptr[tid]
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[inc_scan_ptr[tid]] = data_ptr[argsorted_indices_ptr[tid]]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[inc_scan_ptr[tid]] = data_ptr[argsorted_indices_ptr[tid]]
+    return ib.get()
+
+
+def _calc_counts_sorted_ir(inc_scan, counts):
+    ib = tvm.tir.ir_builder.create()
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    counts_ptr = ib.buffer_ptr(counts)
+
+    batch_size = inc_scan.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            counts_ptr[tid] = 0
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        atomic_add_return = ib.allocate(counts.dtype, (1,), name="atomic_add_return", scope="local")
+        with ib.if_scope(tid < batch_size):
+            index = inc_scan_ptr[tid]
+            atomic_add_return[0] = tvm.tir.call_intrin(
+                counts.dtype,
+                "tir.atomic_add",

Review comment:
       Looks great




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7441: [Frontend][Tensorflow] Add unique operator

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7441:
URL: https://github.com/apache/tvm/pull/7441#discussion_r582587901



##########
File path: python/tvm/topi/cuda/unique.py
##########
@@ -0,0 +1,394 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name
+"""Unique operator"""
+import tvm
+from tvm import te, tir
+from ...te import hybrid
+from .scan import cumsum
+from .sort import sort, argsort
+from ..utils import ceil_div
+
+
+def _calc_adjacent_diff_ir(data, output, binop=tir.Sub):
+    """Low level IR to calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    output: Buffer
+        A buffer to store adjacent difference, of the same shape as data. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+
+    binop: function, optional
+        A binary associative op to use for calculating adjacent difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    output_ptr = ib.buffer_ptr(output)
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            with ib.if_scope(tid == 0):
+                output_ptr[tid] = 0
+            with ib.else_scope():
+                output_ptr[tid] = tir.Cast(output.dtype, binop(data_ptr[tid], data_ptr[tid - 1]))
+    return ib.get()
+
+
+def _calc_adjacent_diff(data, out_dtype="int32", binop=tir.Sub):
+    """Function calculate adjacent difference in an 1-D array.
+
+    Parameters
+    ----------
+    data : tvm.te.Tensor
+        Input 1-D tensor.
+
+    output_dtype : str
+        The output tensor data type.
+
+    binop: function, optional
+        A binary associative op to use for calculating difference. The function takes two
+        TIR expressions and produce a new TIR expression. By default it uses tvm.tir.Sub to
+        compute the adjacent difference.
+
+    Returns
+    -------
+    output : tvm.te.Tensor
+        1-D tensor storing the adjacent difference of the input tensor. The adjacent difference
+        is defined as: output[0] = 0, output[i] = binop(data[i], data[i-1])
+        where i > 0 and i < len(data).
+    """
+    data_buf = tir.decl_buffer(data.shape, data.dtype, "sorted_data_buf", data_alignment=8)
+    output_buf = tir.decl_buffer(data.shape, out_dtype, "output_buf", data_alignment=8)
+    return te.extern(
+        [data.shape],
+        [data],
+        lambda ins, outs: _calc_adjacent_diff_ir(ins[0], outs[0], binop=binop),
+        dtype=[out_dtype],
+        in_buffers=[data_buf],
+        out_buffers=[output_buf],
+        name="_calc_adjacent_diff",
+        tag="_calc_adjacent_diff_gpu",
+    )
+
+
+@hybrid.script
+def _calc_num_unique(inc_scan):
+    """Helper function to get the number of unique elements fron inc_scan tensor"""
+    output = output_tensor((1,), "int32")
+    for i in bind("threadIdx.x", 1):
+        output[i] = inc_scan[inc_scan.shape[0] - 1] + int32(1)
+    return output
+
+
+def _calc_unique_ir(
+    data, argsorted_indices, inc_scan, index_converter, unique_elements, indices, counts
+):
+    """Low level IR to calculate unique elements, inverse indices, and counts (optional) of
+    unique elements of 1-D array.
+
+    Parameters
+    ----------
+    data : Buffer
+        Input 1-D Buffer.
+
+    argsorted_indices : Buffer
+        A buffer that stores the argsorted indices of the input data.
+
+    inc_scan : Buffer
+        A buffer that stores the inclusive scan of the binary tir.NE adjacent difference
+        of the sorted data.
+
+    index_converter (optional) : Buffer
+        An optional index converter that transforms the unique element index
+        such that new_idx = index_converter[old_idx].
+
+    unique_elements : Buffer
+        A buffer that stores the unique elements.
+
+    indices : Buffer
+        A buffer that stores the the index of each input data element in the unique element array.
+
+    counts (optional) : Buffer
+        A buffer that stores the count of each unique element.
+    """
+    ib = tir.ir_builder.create()
+    data_ptr = ib.buffer_ptr(data)
+    argsorted_indices_ptr = ib.buffer_ptr(argsorted_indices)
+    inc_scan_ptr = ib.buffer_ptr(inc_scan)
+    unique_elements_ptr = ib.buffer_ptr(unique_elements)
+    indices_ptr = ib.buffer_ptr(indices)
+
+    index_converter_ptr = None
+    if isinstance(index_converter, tir.Buffer):
+        index_converter_ptr = ib.buffer_ptr(index_converter)
+
+    if isinstance(counts, tir.Buffer):
+        counts_ptr = ib.buffer_ptr(counts)
+        arange_ptr = ib.allocate(counts_ptr.dtype, counts.shape, name="arange_buf", scope="global")
+
+    batch_size = data.shape[0]
+    max_threads = tir.min(batch_size, tvm.target.Target.current(allow_none=False).max_num_threads)
+
+    # calculate unique elements and inverse indices
+    with ib.new_scope():
+        nthread_tx = max_threads
+        nthread_bx = ceil_div(batch_size, max_threads)
+        tx = te.thread_axis("threadIdx.x")
+        bx = te.thread_axis("blockIdx.x")
+        ib.scope_attr(tx, "thread_extent", nthread_tx)
+        ib.scope_attr(bx, "thread_extent", nthread_bx)
+        tid = bx * max_threads + tx
+        with ib.if_scope(tid < batch_size):
+            data_idx = argsorted_indices_ptr[tid]
+            unique_idx = (
+                inc_scan_ptr[tid]
+                if not index_converter_ptr
+                else index_converter_ptr[inc_scan_ptr[tid]]
+            )
+            indices_ptr[data_idx] = unique_idx
+            with ib.if_scope(tid == 0):
+                unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+            with ib.else_scope():
+                with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                    unique_elements_ptr[unique_idx] = data_ptr[data_idx]
+
+    # if need to return counts
+    if isinstance(counts, tir.Buffer):
+        num_unique = inc_scan_ptr[inc_scan.shape[0] - 1] + 1
+        num_elements = data.shape[0]
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < batch_size):
+                with ib.if_scope(tid == 0):
+                    arange_ptr[num_unique - 1] = num_elements
+                with ib.else_scope():
+                    with ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1]):
+                        arange_ptr[inc_scan_ptr[tid] - 1] = tid
+        with ib.new_scope():
+            nthread_tx = max_threads
+            nthread_bx = ceil_div(batch_size, max_threads)
+            tx = te.thread_axis("threadIdx.x")
+            bx = te.thread_axis("blockIdx.x")
+            ib.scope_attr(tx, "thread_extent", nthread_tx)
+            ib.scope_attr(bx, "thread_extent", nthread_bx)
+            tid = bx * max_threads + tx
+            with ib.if_scope(tid < num_unique):
+                unique_idx = tid if not index_converter_ptr else index_converter_ptr[tid]
+                with ib.if_scope(tid == 0):
+                    counts_ptr[unique_idx] = arange_ptr[tid]
+                with ib.else_scope():

Review comment:
       I didn't get why you need to fill in `arange_ptr` with cumulated counts or need a temp buffer. If you move the condition `ib.if_scope(inc_scan_ptr[tid] != inc_scan_ptr[tid - 1])` at L200 here, can't you inline the difference computation without materializing `arange_ptr`? I guess you will be writing to index `inc_scan_ptr[tid] - 1` instead of `unique_idx`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org