You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/02/15 00:32:45 UTC

[GitHub] [arrow] jorisvandenbossche commented on issue #34173: [C++][Python] Allow pyarrow.compute.mode to include null count

jorisvandenbossche commented on issue #34173:
URL: https://github.com/apache/arrow/issues/34173#issuecomment-1430583319

   I have to say that the current `skip_null` keyword is a bit strange, and it also doesn't do what is documented:
   
   ```
   >>> pc.mode([1, 2, 2, None, None], 2, skip_nulls=True) 
   <pyarrow.lib.StructArray object at 0x7f64ea3ab940>
   -- is_valid: all not null
   -- child 0 type: int64
     [
       2,
       1
     ]
   -- child 1 type: int64
     [
       2,
       1
     ]
   
   >>> pc.mode([1, 2, 2, None, None], 2, skip_nulls=False)
   <pyarrow.lib.StructArray object at 0x7f64ea4c3e20>
   -- is_valid: all not null
   -- child 0 type: int64
     []
   -- child 1 type: int64
     []
   ```
   
   So if there are nulls and you specify to not skip them, the result becomes _empty_, while the docs say that if any input value is null, the output becomes null for `skip_nulls=False` (but null and empty is not the same).
   
   I am not fully sure what "mode" should do with nulls. On the one hand, the mode kernel is related to "value_counts", and this kernel will also count nulls. On the other hand, it's more seen as a (numerical) aggregation kernel, and those all skip (by default) or just propagate nulls (I don't think there is another aggregation kernel that really "does" something with the nulls).
   
   cc @cyb70289 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org