You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/25 12:34:00 UTC
[jira] [Updated] (ARROW-10091) [C++][Dataset] Support isin filter for row group (statistics-based) filtering

     [ https://issues.apache.org/jira/browse/ARROW-10091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-10091:
------------------------------------------
    Description: 
Currently the {{isin}} filter works for partition-based filtering, but not for row group (statistics)-based filtering. 

Of course, for a partition-based expression like {{name == 'a'}} is it easier to check this, as for statistics (min/max) expressions like {{(name > 'a') & (name < 'a')}} we would need to check for the special case of min and max being equal (I think this is the only case we can say something with certainly for an "isin" expression?)

Code example:

{code:python}
>>> table = pa.table(\{"name": np.repeat(["a", "b", "c", "d"], 5), "value": np.arange(20)\})
>>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
>>> dataset = ds.dataset("test_filter_string.parquet")# get the single file fragment (dataset consists of one file)>>> fragment = list(dataset.get_fragments())[0]
>>> fragment.ensure_complete_metadata()

# check that we do have statistics for our row groups
>>> fragment.row_groups[0].statistics
{'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}}

# I created the file such that there are 4 row groups (each with a unique value in the name column)
>>> fragment.split_by_row_group()
[<pyarrow._dataset.ParquetFileFragment at 0x7ff783939810>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783728cd8>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff78376c9c0>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff7835efd68>]

# simple equality filter works as expected -> only single row group left
>>> filter = ds.field("name") == "a"
>>> fragment.split_by_row_group(filter)
[<pyarrow._dataset.ParquetFileFragment at 0x7ff783662738>]

# isin filter does not work
>>> filter = ds.field("name").isin(["a", "b"])
>>> fragment.split_by_row_group(filter)
[<pyarrow._dataset.ParquetFileFragment at 0x7ff7837f46f0>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783627a98>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783581b70>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff7835fb780>]
{python}

  was:
Currently the {{isin}} filter works for partition-based filtering, but not for row group (statistics)-based filtering. 

Of course, for a partition-based expression like {{name == 'a'}} is it easier to check this, as for statistics (min/max) expressions like {{(name > 'a') & (name < 'a')}} we would need to check for the special case of min and max being equal (I think this is the only case we can say something with certainly for an "isin" expression?)

Code example:

{code:python}
>>> table = pa.table({"name": np.repeat(["a", "b", "c", "d"], 5), "value": np.arange(20)})
>>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
>>> dataset = ds.dataset("test_filter_string.parquet")# get the single file fragment (dataset consists of one file)>>> fragment = list(dataset.get_fragments())[0]
>>> fragment.ensure_complete_metadata()

# check that we do have statistics for our row groups
>>> fragment.row_groups[0].statistics
{'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}}

# I created the file such that there are 4 row groups (each with a unique value in the name column)
>>> fragment.split_by_row_group()
[<pyarrow._dataset.ParquetFileFragment at 0x7ff783939810>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783728cd8>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff78376c9c0>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff7835efd68>]

# simple equality filter works as expected -> only single row group left
>>> filter = ds.field("name") == "a"
>>> fragment.split_by_row_group(filter)
[<pyarrow._dataset.ParquetFileFragment at 0x7ff783662738>]

# isin filter does not work
>>> filter = ds.field("name").isin(["a", "b"])
>>> fragment.split_by_row_group(filter)
[<pyarrow._dataset.ParquetFileFragment at 0x7ff7837f46f0>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783627a98>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783581b70>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff7835fb780>]
{python}


> [C++][Dataset] Support isin filter for row group (statistics-based) filtering
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-10091
>                 URL: https://issues.apache.org/jira/browse/ARROW-10091
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> Currently the {{isin}} filter works for partition-based filtering, but not for row group (statistics)-based filtering. 
> Of course, for a partition-based expression like {{name == 'a'}} is it easier to check this, as for statistics (min/max) expressions like {{(name > 'a') & (name < 'a')}} we would need to check for the special case of min and max being equal (I think this is the only case we can say something with certainly for an "isin" expression?)
> Code example:
> {code:python}
> >>> table = pa.table(\{"name": np.repeat(["a", "b", "c", "d"], 5), "value": np.arange(20)\})
> >>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
> >>> dataset = ds.dataset("test_filter_string.parquet")# get the single file fragment (dataset consists of one file)>>> fragment = list(dataset.get_fragments())[0]
> >>> fragment.ensure_complete_metadata()
> # check that we do have statistics for our row groups
> >>> fragment.row_groups[0].statistics
> {'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}}
> # I created the file such that there are 4 row groups (each with a unique value in the name column)
> >>> fragment.split_by_row_group()
> [<pyarrow._dataset.ParquetFileFragment at 0x7ff783939810>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783728cd8>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff78376c9c0>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff7835efd68>]
> # simple equality filter works as expected -> only single row group left
> >>> filter = ds.field("name") == "a"
> >>> fragment.split_by_row_group(filter)
> [<pyarrow._dataset.ParquetFileFragment at 0x7ff783662738>]
> # isin filter does not work
> >>> filter = ds.field("name").isin(["a", "b"])
> >>> fragment.split_by_row_group(filter)
> [<pyarrow._dataset.ParquetFileFragment at 0x7ff7837f46f0>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783627a98>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783581b70>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff7835fb780>]
> {python}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)