You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/02/16 15:21:36 UTC

[GitHub] [arrow] jorisvandenbossche commented on issue #34162: [Python] `is_null(nan_is_null=True)` does not work with only NaN's

jorisvandenbossche commented on issue #34162:
URL: https://github.com/apache/arrow/issues/34162#issuecomment-1433256003

   @Fokko I am trying to reproduce this with just pyarrow, but for now not succeeding.
   
   First, just checking if plain filtering on in memory data works (which it does):
   
   ```
   >>> table = pa.table({"idx": [1, 2, 3], "col_numeric": [np.nan, None, 1]})
   >>> table
   pyarrow.Table
   idx: int64
   col_numeric: double
   ----
   idx: [[1,2,3]]
   col_numeric: [[nan,null,1]]
   
   >>> table.filter(pc.field('col_numeric').is_null(nan_is_null=True) & ~pc.field('col_numeric').is_null())
   pyarrow.Table
   idx: int64
   col_numeric: double
   ----
   idx: [[1]]
   col_numeric: [[nan]]
   ```
   
   Then, if I write this file to Parquet and ensure to get each row in one row group (still in a single file), reading it with a filter should use the row group statistics for pruning as a first step. This also seems to work:
   
   ```
   >>> pq.write_table(table, "test_filter_nan.parquet", row_group_size=1)
   >>> meta = pq.read_metadata("test_filter_nan.parquet")
   >>> meta.num_row_groups
   3
   >>> pq.read_table("test_filter_nan.parquet", filters=pc.field('col_numeric').is_null(nan_is_null=True) & ~pc.field('col_numeric').is_null())
   pyarrow.Table
   idx: int64
   col_numeric: double
   ----
   idx: [[1]]
   col_numeric: [[nan]]
   ```
   
   Now, maybe this depends on how the Parquet file was written. When written with pyarrow as above, the row groups with the values NaN and null don't have statistics set (and so won't never be skipped or not because of predicate pushdown rowgroup filtering).  
   @Fokko Your files were created with Spark, I assume? Would it be possible to share those 3 small parquet files from your example above?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org