You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/08/13 12:45:06 UTC
[jira] [Created] (ARROW-9723) [C++] Expected behaviour of "mode"
kernel with NaNs ?
Joris Van den Bossche created ARROW-9723:
--------------------------------------------
Summary: [C++] Expected behaviour of "mode" kernel with NaNs ?
Key: ARROW-9723
URL: https://issues.apache.org/jira/browse/ARROW-9723
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
ARROW-9638 added a "mode" kernel to arrow::compute. There was some remaining discussion on how NaNs should be handled.
The merged PR added the behaviour to "skip" NaNs (similarly as it skips nulls). So eg:
{code}
[NaN, NaN, 1] -> mode:1, count:1
[null, null, 1] -> mode:1, count:1
[null, null, null] -> null
[NaN, NaN, NaN] -> null
{code}
But, for example {{scipy.stats}} does not skip NaNs and would for the last line above return {{mode:NaN, count:1}} (the NaNs are not equal to each other, so each NaN is counted separately, giving a count of 1).
Also, in other aggregations like {{sum}} we skip nulls but not NaNs (so {{sum([NaN, NaN, 1])}} would be NaN).
On the other hand, as [~apitrou] argued in the PR, for {{sum}} it's more straightforward and informative to propagate the NaN to the result (at least it indicates there are NaNs in the data), while for {{mode}} the count of 1 can also be surprising/misleading.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)