You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/26 05:31:51 UTC

[GitHub] [arrow] cyb70289 opened a new pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

cyb70289 opened a new pull request #8524:
URL: https://github.com/apache/arrow/pull/8524


   For sorting kernel, this patch treats NaN as the largest floating point
   number and moves them to the end of array, before Nulls.
   For partition_nth kernel, this patch ignores NaN (treats them as Nulls).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724745103


   Looks github does something wrong with "Allow edits and access to secrets by maintainers". It was always checked by default. But now it's unchecked. Checking it, refreshing the page, and it's unchecked automatically.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724403787


   > Is "sort_indices" actually stable? The [documentation](https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions) says "non-stable".
   
   In current code, "sort_indices" is stable, it uses "std::stable_sort". "partition_nth_indices" is not stable, it uses "std::nth_element".
   [numpy.argsort](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) has a parameter to select sorting algorithm, maybe arrow can also implement unstable quicksort (std::sort) and provide a kernel option to select.
   
   ```
   # comparison of numpy stable and unstable sort
   # quicksort gives better performance
   
   In [1]: import numpy as np
   
   In [2]: a = np.random.randint(0, 100, 12345678)
   
   In [3]: time i1 = np.argsort(a, kind='quicksort')
   CPU times: user 909 ms, sys: 4.67 ms, total: 914 ms
   Wall time: 913 ms
   
   In [4]: time i2 = np.argsort(a, kind='stable')
   CPU times: user 998 ms, sys: 20.8 ms, total: 1.02 s
   Wall time: 1.02 s
   
   # stable and quick sort gives different result
   In [5]: (i1 == i2).all()
   Out[5]: False
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#discussion_r519903359



##########
File path: cpp/src/arrow/compute/kernels/vector_sort_test.cc
##########
@@ -121,6 +132,10 @@ TYPED_TEST(TestNthToIndicesForReal, Real) {
   this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 2);
   this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 5);
   this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 6);
+
+  this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 0);
+  this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 1);
+  this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 2);

Review comment:
       How about a mix of nulls and nans? I would expect the result to be deterministic (i.e. nulls logically come after nans).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#discussion_r520353233



##########
File path: cpp/src/arrow/compute/kernels/vector_sort_test.cc
##########
@@ -243,25 +258,23 @@ TYPED_TEST(TestSortToIndicesKernelForReal, SortReal) {
   this->AssertSortToIndices("[]", "[]");
 
   this->AssertSortToIndices("[3.4, 2.6, 6.3]", "[1, 0, 2]");
-
   this->AssertSortToIndices("[1.1, 2.4, 3.5, 4.3, 5.1, 6.8, 7.3]", "[0,1,2,3,4,5,6]");
-
   this->AssertSortToIndices("[7, 6, 5, 4, 3, 2, 1]", "[6,5,4,3,2,1,0]");
-
   this->AssertSortToIndices("[10.4, 12, 4.2, 50, 50.3, 32, 11]", "[2,0,6,1,5,3,4]");
 
   this->AssertSortToIndices("[null, 1, 3.3, null, 2, 5.3]", "[1,4,2,5,0,3]");
+
+  this->AssertSortToIndices("[3, 4, NaN, 1, 2, null]", "[3,4,0,1,2,5]");
+  this->AssertSortToIndices("[NaN, 2, NaN, 3, 1]", "[4,1,3,0,2]");
+  this->AssertSortToIndices("[null, NaN, NaN, null]", "[1,2,0,3]");

Review comment:
       Sort is stable in current implementation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-716342762


   CI failure is about LLVM version when building gandiva


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724773476


   Ah, weird. That may be the explanation...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724738890


   New PR is #8623.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724090735


   Is "sort_indices" actually stable? The [documentation](https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions) says "non-stable".


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8524:
URL: https://github.com/apache/arrow/pull/8524


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-716315945


   https://issues.apache.org/jira/browse/ARROW-10345


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#discussion_r519904982



##########
File path: cpp/src/arrow/compute/kernels/vector_sort_test.cc
##########
@@ -243,25 +258,23 @@ TYPED_TEST(TestSortToIndicesKernelForReal, SortReal) {
   this->AssertSortToIndices("[]", "[]");
 
   this->AssertSortToIndices("[3.4, 2.6, 6.3]", "[1, 0, 2]");
-
   this->AssertSortToIndices("[1.1, 2.4, 3.5, 4.3, 5.1, 6.8, 7.3]", "[0,1,2,3,4,5,6]");
-
   this->AssertSortToIndices("[7, 6, 5, 4, 3, 2, 1]", "[6,5,4,3,2,1,0]");
-
   this->AssertSortToIndices("[10.4, 12, 4.2, 50, 50.3, 32, 11]", "[2,0,6,1,5,3,4]");
 
   this->AssertSortToIndices("[null, 1, 3.3, null, 2, 5.3]", "[1,4,2,5,0,3]");
+
+  this->AssertSortToIndices("[3, 4, NaN, 1, 2, null]", "[3,4,0,1,2,5]");
+  this->AssertSortToIndices("[NaN, 2, NaN, 3, 1]", "[4,1,3,0,2]");
+  this->AssertSortToIndices("[null, NaN, NaN, null]", "[1,2,0,3]");

Review comment:
       Since the sort is non-stable, the result could also be `[2,1,0,3]`, etc., no?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 removed a comment on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
cyb70289 removed a comment on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-716342762


   CI failure is about LLVM version when building gandiva


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#discussion_r520352989



##########
File path: cpp/src/arrow/compute/kernels/vector_sort_test.cc
##########
@@ -121,6 +132,10 @@ TYPED_TEST(TestNthToIndicesForReal, Real) {
   this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 2);
   this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 5);
   this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 6);
+
+  this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 0);
+  this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 1);
+  this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 2);

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724091219


   Also, you'll need to update the docs :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724735164


   I closed because Github didn't pick up the changes I pushed. I'll open a new PR (sorry).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org