You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/11 13:10:31 UTC

[GitHub] [arrow] deamontg opened a new issue #9160: How to filter parquet column with None using Python?

deamontg opened a new issue #9160:
URL: https://github.com/apache/arrow/issues/9160


   I'm trying to read a parquet file using pyarrow read_table(), and I would like to filter columns using None. I've tried something like the following:
   
   ```
   import pyarrow as pa
   import pyarrow.parquet as pq
   table = pa.Table.from_arrays([[None, None, 'a', 'b', 'a', 'c']], names=['column'])
   pq.write_table(table, 'data.parquet')
   table = pq.read_table('data.parquet', filters=[[('column', '=', None)]])
   ```
   
   This example does not work, as the read-in table contains no records. How can I properly filter a column for None values when reading a table?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on issue #9160: How to filter parquet column with None using Python?

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #9160:
URL: https://github.com/apache/arrow/issues/9160#issuecomment-758654770


   The problem is that a null is not equal to itself, so you can't filter nulls with an `==` equality check. 
   
   For the new dataset API, we are working on more powerful filter expressions, and you can already achieve this:
   
   ```
   In [21]: import pyarrow.dataset as ds
   
   In [22]: pq.read_table('data.parquet', filters=~ds.field("column").is_valid()).to_pandas()
   Out[22]: 
     column
   0   None
   1   None
   ``` 
   
   We should probably also add a `is_null()` method to make this case a bit more straightforward. 
   
   
   ---
   
   General note: we prefer the user mailing list for such questions, see https://arrow.apache.org/community/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche closed issue #9160: How to filter parquet column with None using Python?

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche closed issue #9160:
URL: https://github.com/apache/arrow/issues/9160


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] deamontg commented on issue #9160: How to filter parquet column with None using Python?

Posted by GitBox <gi...@apache.org>.
deamontg commented on issue #9160:
URL: https://github.com/apache/arrow/issues/9160#issuecomment-760210006


   Doing it this way works! Thanks for your help.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org