You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Abraham Elmahrek <ab...@apache.org> on 2019/05/21 17:04:59 UTC

ParquetDataset Filters Question

Folks

Does any one know how to do the following with filters for ParquetDataset
(DNF): A ⋀ B ⋀ (C ⋁ D)?

I've tried the following without luck:

dataset = pq.ParquetDataset("<>", filesystem=s3fs.S3FileSystem(), filters=[
>     ("col", ">=", "<>"),
>     ("col", "<=", "<>"),
>     [[("col", "=", "<>")], [("col", "=", "<>")]]
> ])


Where A = ("col", ">=", "<>"), B = ("col", "<=", "<>"), C = ("col", "=",
"<>"), and D = ("col", "=", "<>").

In the above example, I get the following error:

>   File
> "/opt/miniconda/envs/flatiron-cron/lib/python3.6/site-packages/pyarrow-0.13.0-py3.6-linux-x86_64.egg/pyarrow/parquet.py",
> line 961, in __init__
>     filters = _check_filters(filters)
>   File
> "/opt/miniconda/envs/flatiron-cron/lib/python3.6/site-packages/pyarrow-0.13.0-py3.6-linux-x86_64.egg/pyarrow/parquet.py",
> line 93, in _check_filters
>     for col, op, val in conjunction:
> ValueError: not enough values to unpack (expected 3, got 2)


Abe

Re: ParquetDataset Filters Question

Posted by Abraham Elmahrek <ab...@elmahrek.com>.

Thanks guys. That makes sense.

On Thu, May 23, 2019 at 4:06 AM Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Abe,
>
> I think the problems lies in the case that you mix two syntaxes. We either
> support a "list of tuples" or "list of lists of tuples". Furthermore the
> correct DNF for your filter would be (A ⋀ B ⋀ C)  ⋁  (A ⋀ B ⋀ D), thus you
> should use
>
> filters = [[("col", ">=", "<A>"),  ("col", "<=", "<B>"), ("col", "=",
> "<C>")],  [("col", ">=", "<A>"),  ("col", "<=", "<B>"), ("col", "=",
> "<D>")]]
>
> Uwe
>
> [[("col", ">=", "<>"),
> > >>     ("col", "<=", "<>"),
> > >>     [[("col", "=", "<>")], [("col", "=", "<>")]]
> > >>
>
> On Wed, May 22, 2019, at 9:12 PM, Wes McKinney wrote:
> > hi Abe -- you may have to open a JIRA about documentation improvement
> > and/or bug fix for this. I don't know off-hand. Copying the dev@ list
> >
> > - Wes
> >
> > On Tue, May 21, 2019 at 12:05 PM Abraham Elmahrek <ab...@apache.org>
> wrote:
> > >
> > > Folks
> > >
> > > Does any one know how to do the following with filters for
> ParquetDataset (DNF): A ⋀ B ⋀ (C ⋁ D)?
> > >
> > > I've tried the following without luck:
> > >
> > >> dataset = pq.ParquetDataset("<>", filesystem=s3fs.S3FileSystem(),
> filters=[
> > >>     ("col", ">=", "<>"),
> > >>     ("col", "<=", "<>"),
> > >>     [[("col", "=", "<>")], [("col", "=", "<>")]]
> > >> ])
> > >
> > >
> > > Where A = ("col", ">=", "<>"), B = ("col", "<=", "<>"), C = ("col",
> "=", "<>"), and D = ("col", "=", "<>").
> > >
> > > In the above example, I get the following error:
> > >>
> > >>   File
> "/opt/miniconda/envs/flatiron-cron/lib/python3.6/site-packages/pyarrow-0.13.0-py3.6-linux-x86_64.egg/pyarrow/parquet.py",
> line 961, in __init__
> > >>     filters = _check_filters(filters)
> > >>   File
> "/opt/miniconda/envs/flatiron-cron/lib/python3.6/site-packages/pyarrow-0.13.0-py3.6-linux-x86_64.egg/pyarrow/parquet.py",
> line 93, in _check_filters
> > >>     for col, op, val in conjunction:
> > >> ValueError: not enough values to unpack (expected 3, got 2)
> > >
> > >
> > > Abe
> >
>

Re: ParquetDataset Filters Question

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Abe,

I think the problems lies in the case that you mix two syntaxes. We either support a "list of tuples" or "list of lists of tuples". Furthermore the correct DNF for your filter would be (A ⋀ B ⋀ C)  ⋁  (A ⋀ B ⋀ D), thus you should use

filters = [[("col", ">=", "<A>"),  ("col", "<=", "<B>"), ("col", "=", "<C>")],  [("col", ">=", "<A>"),  ("col", "<=", "<B>"), ("col", "=", "<D>")]]

Uwe
 
[[("col", ">=", "<>"),
> >>     ("col", "<=", "<>"),
> >>     [[("col", "=", "<>")], [("col", "=", "<>")]]
> >> 

On Wed, May 22, 2019, at 9:12 PM, Wes McKinney wrote:
> hi Abe -- you may have to open a JIRA about documentation improvement
> and/or bug fix for this. I don't know off-hand. Copying the dev@ list
> 
> - Wes
> 
> On Tue, May 21, 2019 at 12:05 PM Abraham Elmahrek <ab...@apache.org> wrote:
> >
> > Folks
> >
> > Does any one know how to do the following with filters for ParquetDataset (DNF): A ⋀ B ⋀ (C ⋁ D)?
> >
> > I've tried the following without luck:
> >
> >> dataset = pq.ParquetDataset("<>", filesystem=s3fs.S3FileSystem(), filters=[
> >>     ("col", ">=", "<>"),
> >>     ("col", "<=", "<>"),
> >>     [[("col", "=", "<>")], [("col", "=", "<>")]]
> >> ])
> >
> >
> > Where A = ("col", ">=", "<>"), B = ("col", "<=", "<>"), C = ("col", "=", "<>"), and D = ("col", "=", "<>").
> >
> > In the above example, I get the following error:
> >>
> >>   File "/opt/miniconda/envs/flatiron-cron/lib/python3.6/site-packages/pyarrow-0.13.0-py3.6-linux-x86_64.egg/pyarrow/parquet.py", line 961, in __init__
> >>     filters = _check_filters(filters)
> >>   File "/opt/miniconda/envs/flatiron-cron/lib/python3.6/site-packages/pyarrow-0.13.0-py3.6-linux-x86_64.egg/pyarrow/parquet.py", line 93, in _check_filters
> >>     for col, op, val in conjunction:
> >> ValueError: not enough values to unpack (expected 3, got 2)
> >
> >
> > Abe
>

Re: ParquetDataset Filters Question

Posted by Wes McKinney <we...@gmail.com>.

hi Abe -- you may have to open a JIRA about documentation improvement
and/or bug fix for this. I don't know off-hand. Copying the dev@ list

- Wes

On Tue, May 21, 2019 at 12:05 PM Abraham Elmahrek <ab...@apache.org> wrote:
>
> Folks
>
> Does any one know how to do the following with filters for ParquetDataset (DNF): A ⋀ B ⋀ (C ⋁ D)?
>
> I've tried the following without luck:
>
>> dataset = pq.ParquetDataset("<>", filesystem=s3fs.S3FileSystem(), filters=[
>>     ("col", ">=", "<>"),
>>     ("col", "<=", "<>"),
>>     [[("col", "=", "<>")], [("col", "=", "<>")]]
>> ])
>
>
> Where A = ("col", ">=", "<>"), B = ("col", "<=", "<>"), C = ("col", "=", "<>"), and D = ("col", "=", "<>").
>
> In the above example, I get the following error:
>>
>>   File "/opt/miniconda/envs/flatiron-cron/lib/python3.6/site-packages/pyarrow-0.13.0-py3.6-linux-x86_64.egg/pyarrow/parquet.py", line 961, in __init__
>>     filters = _check_filters(filters)
>>   File "/opt/miniconda/envs/flatiron-cron/lib/python3.6/site-packages/pyarrow-0.13.0-py3.6-linux-x86_64.egg/pyarrow/parquet.py", line 93, in _check_filters
>>     for col, op, val in conjunction:
>> ValueError: not enough values to unpack (expected 3, got 2)
>
>
> Abe

Re: ParquetDataset Filters Question

Posted by Wes McKinney <we...@gmail.com>.

hi Abe -- you may have to open a JIRA about documentation improvement
and/or bug fix for this. I don't know off-hand. Copying the dev@ list

- Wes

On Tue, May 21, 2019 at 12:05 PM Abraham Elmahrek <ab...@apache.org> wrote:
>
> Folks
>
> Does any one know how to do the following with filters for ParquetDataset (DNF): A ⋀ B ⋀ (C ⋁ D)?
>
> I've tried the following without luck:
>
>> dataset = pq.ParquetDataset("<>", filesystem=s3fs.S3FileSystem(), filters=[
>>     ("col", ">=", "<>"),
>>     ("col", "<=", "<>"),
>>     [[("col", "=", "<>")], [("col", "=", "<>")]]
>> ])
>
>
> Where A = ("col", ">=", "<>"), B = ("col", "<=", "<>"), C = ("col", "=", "<>"), and D = ("col", "=", "<>").
>
> In the above example, I get the following error:
>>
>>   File "/opt/miniconda/envs/flatiron-cron/lib/python3.6/site-packages/pyarrow-0.13.0-py3.6-linux-x86_64.egg/pyarrow/parquet.py", line 961, in __init__
>>     filters = _check_filters(filters)
>>   File "/opt/miniconda/envs/flatiron-cron/lib/python3.6/site-packages/pyarrow-0.13.0-py3.6-linux-x86_64.egg/pyarrow/parquet.py", line 93, in _check_filters
>>     for col, op, val in conjunction:
>> ValueError: not enough values to unpack (expected 3, got 2)
>
>
> Abe