You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/26 05:31:51 UTC
[GitHub] [arrow] cyb70289 opened a new pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
cyb70289 opened a new pull request #8524:
URL: https://github.com/apache/arrow/pull/8524
For sorting kernel, this patch treats NaN as the largest floating point
number and moves them to the end of array, before Nulls.
For partition_nth kernel, this patch ignores NaN (treats them as Nulls).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724745103
Looks github does something wrong with "Allow edits and access to secrets by maintainers". It was always checked by default. But now it's unchecked. Checking it, refreshing the page, and it's unchecked automatically.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724403787
> Is "sort_indices" actually stable? The [documentation](https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions) says "non-stable".
In current code, "sort_indices" is stable, it uses "std::stable_sort". "partition_nth_indices" is not stable, it uses "std::nth_element".
[numpy.argsort](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) has a parameter to select sorting algorithm, maybe arrow can also implement unstable quicksort (std::sort) and provide a kernel option to select.
```
# comparison of numpy stable and unstable sort
# quicksort gives better performance
In [1]: import numpy as np
In [2]: a = np.random.randint(0, 100, 12345678)
In [3]: time i1 = np.argsort(a, kind='quicksort')
CPU times: user 909 ms, sys: 4.67 ms, total: 914 ms
Wall time: 913 ms
In [4]: time i2 = np.argsort(a, kind='stable')
CPU times: user 998 ms, sys: 20.8 ms, total: 1.02 s
Wall time: 1.02 s
# stable and quick sort gives different result
In [5]: (i1 == i2).all()
Out[5]: False
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#discussion_r519903359
##########
File path: cpp/src/arrow/compute/kernels/vector_sort_test.cc
##########
@@ -121,6 +132,10 @@ TYPED_TEST(TestNthToIndicesForReal, Real) {
this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 2);
this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 5);
this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 6);
+
+ this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 0);
+ this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 1);
+ this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 2);
Review comment:
How about a mix of nulls and nans? I would expect the result to be deterministic (i.e. nulls logically come after nans).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#discussion_r520353233
##########
File path: cpp/src/arrow/compute/kernels/vector_sort_test.cc
##########
@@ -243,25 +258,23 @@ TYPED_TEST(TestSortToIndicesKernelForReal, SortReal) {
this->AssertSortToIndices("[]", "[]");
this->AssertSortToIndices("[3.4, 2.6, 6.3]", "[1, 0, 2]");
-
this->AssertSortToIndices("[1.1, 2.4, 3.5, 4.3, 5.1, 6.8, 7.3]", "[0,1,2,3,4,5,6]");
-
this->AssertSortToIndices("[7, 6, 5, 4, 3, 2, 1]", "[6,5,4,3,2,1,0]");
-
this->AssertSortToIndices("[10.4, 12, 4.2, 50, 50.3, 32, 11]", "[2,0,6,1,5,3,4]");
this->AssertSortToIndices("[null, 1, 3.3, null, 2, 5.3]", "[1,4,2,5,0,3]");
+
+ this->AssertSortToIndices("[3, 4, NaN, 1, 2, null]", "[3,4,0,1,2,5]");
+ this->AssertSortToIndices("[NaN, 2, NaN, 3, 1]", "[4,1,3,0,2]");
+ this->AssertSortToIndices("[null, NaN, NaN, null]", "[1,2,0,3]");
Review comment:
Sort is stable in current implementation.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-716342762
CI failure is about LLVM version when building gandiva
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724773476
Ah, weird. That may be the explanation...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724738890
New PR is #8623.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724090735
Is "sort_indices" actually stable? The [documentation](https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions) says "non-stable".
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8524:
URL: https://github.com/apache/arrow/pull/8524
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-716315945
https://issues.apache.org/jira/browse/ARROW-10345
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#discussion_r519904982
##########
File path: cpp/src/arrow/compute/kernels/vector_sort_test.cc
##########
@@ -243,25 +258,23 @@ TYPED_TEST(TestSortToIndicesKernelForReal, SortReal) {
this->AssertSortToIndices("[]", "[]");
this->AssertSortToIndices("[3.4, 2.6, 6.3]", "[1, 0, 2]");
-
this->AssertSortToIndices("[1.1, 2.4, 3.5, 4.3, 5.1, 6.8, 7.3]", "[0,1,2,3,4,5,6]");
-
this->AssertSortToIndices("[7, 6, 5, 4, 3, 2, 1]", "[6,5,4,3,2,1,0]");
-
this->AssertSortToIndices("[10.4, 12, 4.2, 50, 50.3, 32, 11]", "[2,0,6,1,5,3,4]");
this->AssertSortToIndices("[null, 1, 3.3, null, 2, 5.3]", "[1,4,2,5,0,3]");
+
+ this->AssertSortToIndices("[3, 4, NaN, 1, 2, null]", "[3,4,0,1,2,5]");
+ this->AssertSortToIndices("[NaN, 2, NaN, 3, 1]", "[4,1,3,0,2]");
+ this->AssertSortToIndices("[null, NaN, NaN, null]", "[1,2,0,3]");
Review comment:
Since the sort is non-stable, the result could also be `[2,1,0,3]`, etc., no?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 removed a comment on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
cyb70289 removed a comment on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-716342762
CI failure is about LLVM version when building gandiva
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#discussion_r520352989
##########
File path: cpp/src/arrow/compute/kernels/vector_sort_test.cc
##########
@@ -121,6 +132,10 @@ TYPED_TEST(TestNthToIndicesForReal, Real) {
this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 2);
this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 5);
this->AssertNthToIndicesJson("[null, 1, 3.3, null, 2, 5.3]", 6);
+
+ this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 0);
+ this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 1);
+ this->AssertNthToIndicesJson("[NaN, 2, NaN, 3, 1]", 2);
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724091219
Also, you'll need to update the docs :-)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724735164
I closed because Github didn't pick up the changes I pushed. I'll open a new PR (sorry).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org