You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Fabrice Lefloch <fl...@payps.fr> on 2021/08/02 13:00:58 UTC

[Python] Upgrading from 3.0.0 to latest version unable to use filter is_valid

Hello,

Previously when using pyarrow 3.0.0 when trying to filter null columns on read_table I was doing it this way: 
pq.read_table(myparquetFile.parquet', filters=~ds.field(« my_field").is_valid())
It was working fine, but when upgrading top yarrow 4.0.0 I am now receiving an error 
"ValueError: An Expression cannot be evaluated to python True or False. If you are using the 'and', 'or' or 'not' operators, use '&', '|' or '~' instead. »
I tried to use is_null() instead of is_valid() but with no luck either.

Is there some other way to apply this filter?

Thank you.

Re: [Python] Upgrading from 3.0.0 to latest version unable to use filter is_valid

Posted by Fabrice Lefloch <fl...@payps.fr>.
OK, I get it!
Indeed better to do it separately (loading the file in a dataset and then applying the filters.

Thank you for your answer :)


> Le 2 août 2021 à 20:37, Weston Pace <we...@gmail.com> a écrit :
> 
> Hmm, it seems you managed to find a bit of an (I think) unintended use case :).
> 
> The docs for pyarrow.parquet.read_table describe the "filters" property as:
> 
>> Each tuple has format: (key, op, value) and compares the key with the value. The
>> supported op are: = or ==, !=, <, >, <=, >=, in and not in. If the op is in or not in,
>> the value must be a collection such as a list, a set or a tuple.
>> 
>> Examples:
>> 
>> ('x', '=', 0)
>> ('y', 'in', ['a', 'b', 'c'])
>> ('z', 'not in', {'a','b'})
> 
> On the other hand, the filter you describe
> "~ds.field('my_field').is_valid()" is one of
> the new pyarrow.dataset expression-based filters.
> 
> pyarrow.parquet.read_table has been slowly migrating over to use the new dataset
> scanning (controlled by use_legacy_dataset).  It seems in 3.0.0 we
> must have taken
> whatever filters argument was given and passed it directly as a
> filter.  In 4.0.0 we try
> and take a list of the previously described tuples and convert them to
> dataset filters.
> 
> So the easiest fix is probably to just use the new datasets API directly:
> 
> TL:DR;
> 
>    my_dataset = ds.dataset('myparquetFile.parquet')
>    table = my_dataset.to_table(filter=~ds.field('data').is_valid())
> 
> On Mon, Aug 2, 2021 at 3:01 AM Fabrice Lefloch <fl...@payps.fr> wrote:
>> 
>> Hello,
>> 
>> Previously when using pyarrow 3.0.0 when trying to filter null columns on read_table I was doing it this way:
>> pq.read_table(myparquetFile.parquet', filters=~ds.field(« my_field").is_valid())
>> It was working fine, but when upgrading top yarrow 4.0.0 I am now receiving an error
>> "ValueError: An Expression cannot be evaluated to python True or False. If you are using the 'and', 'or' or 'not' operators, use '&', '|' or '~' instead. »
>> I tried to use is_null() instead of is_valid() but with no luck either.
>> 
>> Is there some other way to apply this filter?
>> 
>> Thank you.


Re: [Python] Upgrading from 3.0.0 to latest version unable to use filter is_valid

Posted by Weston Pace <we...@gmail.com>.
Hmm, it seems you managed to find a bit of an (I think) unintended use case :).

The docs for pyarrow.parquet.read_table describe the "filters" property as:

> Each tuple has format: (key, op, value) and compares the key with the value. The
> supported op are: = or ==, !=, <, >, <=, >=, in and not in. If the op is in or not in,
> the value must be a collection such as a list, a set or a tuple.
>
> Examples:
>
> ('x', '=', 0)
> ('y', 'in', ['a', 'b', 'c'])
> ('z', 'not in', {'a','b'})

On the other hand, the filter you describe
"~ds.field('my_field').is_valid()" is one of
the new pyarrow.dataset expression-based filters.

pyarrow.parquet.read_table has been slowly migrating over to use the new dataset
scanning (controlled by use_legacy_dataset).  It seems in 3.0.0 we
must have taken
whatever filters argument was given and passed it directly as a
filter.  In 4.0.0 we try
and take a list of the previously described tuples and convert them to
dataset filters.

So the easiest fix is probably to just use the new datasets API directly:

TL:DR;

    my_dataset = ds.dataset('myparquetFile.parquet')
    table = my_dataset.to_table(filter=~ds.field('data').is_valid())

On Mon, Aug 2, 2021 at 3:01 AM Fabrice Lefloch <fl...@payps.fr> wrote:
>
> Hello,
>
> Previously when using pyarrow 3.0.0 when trying to filter null columns on read_table I was doing it this way:
> pq.read_table(myparquetFile.parquet', filters=~ds.field(« my_field").is_valid())
> It was working fine, but when upgrading top yarrow 4.0.0 I am now receiving an error
> "ValueError: An Expression cannot be evaluated to python True or False. If you are using the 'and', 'or' or 'not' operators, use '&', '|' or '~' instead. »
> I tried to use is_null() instead of is_valid() but with no luck either.
>
> Is there some other way to apply this filter?
>
> Thank you.