You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/08/13 12:45:06 UTC

[jira] [Created] (ARROW-9723) [C++] Expected behaviour of "mode" kernel with NaNs ?

Joris Van den Bossche created ARROW-9723:
--------------------------------------------

             Summary: [C++] Expected behaviour of "mode" kernel with NaNs ?
                 Key: ARROW-9723
                 URL: https://issues.apache.org/jira/browse/ARROW-9723
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche


ARROW-9638 added a "mode" kernel to arrow::compute. There was some remaining discussion on how NaNs should be handled.

The merged PR added the behaviour to "skip" NaNs (similarly as it skips nulls). So eg:

{code}
[NaN, NaN, 1] -> mode:1, count:1
[null, null, 1] -> mode:1, count:1
[null, null, null] -> null
[NaN, NaN, NaN] -> null
{code}

But, for example {{scipy.stats}} does not skip NaNs and would for the last line above return {{mode:NaN, count:1}} (the NaNs are not equal to each other, so each NaN is counted separately, giving a count of 1).  
Also, in other aggregations like {{sum}} we skip nulls but not NaNs (so {{sum([NaN, NaN, 1])}} would be NaN).

On the other hand, as [~apitrou] argued in the PR, for {{sum}} it's more straightforward and informative to propagate the NaN to the result (at least it indicates there are NaNs in the data), while for {{mode}} the count of 1 can also be surprising/misleading.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)