You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/08/09 07:00:04 UTC
[GitHub] ciyongch commented on issue #12085: Accelerate the performance of topk for CPU side

ciyongch commented on issue #12085: Accelerate the performance of topk for CPU side
URL: https://github.com/apache/incubator-mxnet/pull/12085#issuecomment-411657593
 
 
   @asmushetzel thanks for your comments. 
   1) Actually, I did some other tests with different configuration when calling topk Op, which might help to understand how much improvement we could achieve with this enhancement. Here's more performance data I collected between optimized version and out-of-box version. The number in **speedup** columns are speedup factor, the bigger the better. We can see that all the cases shows good, especially the case when the `axis=3`(equal to -1 since the input is 4d ndarray).
   
   
   
   |          |       | ret_type=value | ret_type=indices | ret_type=both | ret_type=mask |
   | -------- | ----- | -------------- | ---------------- | ------------- | ------------- |
   | **axis** | **k** | **speedup**    | **speedup**      | **speedup**   | **speedup**   |
   | 0        | 1     | 2.52           | 2.60             | 2.49          | 1.00          |
   | 0        | 3     | 2.54           | 2.61             | 2.63          | 1.04          |
   | 0        | 5     | 2.58           | 2.54             | 2.59          | 1.16          |
   | 0        | 10    | 2.52           | 2.58             | 2.58          | 1.09          |
   | 1        | 1     | 3.72           | 3.69             | 3.73          | 1.04          |
   | 1        | 3     | 3.75           | 3.77             | 3.66          | 1.10          |
   | 2        | 1     | 2.40           | 2.35             | 2.30          | 1.11          |
   | 2        | 3     | 2.34           | 2.31             | 2.24          | 1.10          |
   | 2        | 5     | 2.29           | 2.30             | 2.24          | 1.19          |
   | 2        | 10    | 2.33           | 2.33             | 2.34          | 1.09          |
   | 2        | 100   | 2.36           | 2.43             | 2.40          | 1.10          |
   | 2        | 500   | 2.51           | 2.49             | 2.43          | 1.10          |
   | 2        | 1000  | 2.44           | 2.37             | 2.34          | 1.04          |
   | 3        | 1     | 67.29          | 70.23            | 61.32         | 5.22          |
   | 3        | 3     | 64.53          | 64.05            | 61.82         | 5.32          |
   | 3        | 5     | 61.06          | 61.72            | 52.28         | 4.85          |
   | 3        | 10    | 43.06          | 50.45            | 49.08         | 4.28          |
   | 3        | 100   | 21.83          | 22.04            | 21.10         | 2.44          |
   | 3        | 500   | 10.36          | 10.35            | 10.30         | 1.55          |
   | 3        | 1000  | 8.54           | 9.10             | 8.08          | 1.35          |
   
   2) Regarding the code changes, I did two minor changes and kept the main logic same as current version.
      
         - only do the modulus calculation to `ret_indices` instead of full `indices` after calling `TopkSort()` function, which could reduce many redundant calculation when `k` is smaller than element number. So I changed `indices = F<mshadow_op::mod>(indices, element_num);`  to `ret_indices = F<mshadow_op::mod>(ret_indices, element_num);`. That's why the results from `ret_type=[value, indices, both]` shows better than `ret_type=mask`.
         - in the case of `axis=-1`, there's no need to do the `transpose` to the source(input) data, but only requires a flatten to 1D operation. `FlatTo1D` is much efficient than `reshape` since it only changes the Shape but not result in data movement. I noticed that in current CPU `TopkSort()` function, the `work` Tensor is only used to stored the temporary result and then copy its content back to `sorted_data`. If we pass the flattened data via `work` Tensor to `TopkSort()`, and copy the top k number of data from such flattened data into `sorted_data` directly, then there's no need to keep additional temporal Tensor and do the copy second time. For the case `axis!=-1`, both transpose and reshape are required to convert a 3D Tensor to 1D Tensor. That's why the results from `axis=-1` are much better than other cases.
   
   Hope this answers your question, I could add some comments in the code if there's no other concerns.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services