You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/01/12 11:24:00 UTC

[jira] [Updated] (ARROW-15310) [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path?

     [ https://issues.apache.org/jira/browse/ARROW-15310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-15310:
------------------------------------------
    Description: 
When you have a hive-style partitioned dataset, with our current {{dataset(..)}} API, it's relatively easy to mess up the inferred partitioning and get confusing results. 

For example, if you specify the partitioning field names with {{partitioning=[...]}} (which is not needed for hive style since those are inferred), we actually assume you want directory partitioning. This DirectoryPartitioning will then parse the hive-style file paths and take the full "key=value" as the data values for the field.  
And then, doing a filter can result in a confusing empty result (because "value" doesn't match "key=value").

I am wondering if we can't relatively cheaply detect this case, and eg give an informative warning about this to the user. 

Basically what happens is this:

{code:python}
>>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
>>> part.parse("part=a")
<pyarrow.dataset.Expression (part == "part=a")>
{code}

If the parsed value is a string that contains a "=" (and in this case also contains the field name), that is I think a clear sign that (in the large majority of cases) the user is doing something wrong.

I am not fully sure where and at what stage the check could be done though. Doing it for every path in the dataset might be too costly.


----

Illustrative code example:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib

## constructing a small dataset with 1 hive-style partitioning level

basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
basedir.mkdir(exist_ok=True)

(basedir / "part=a").mkdir(exist_ok=True)
(basedir / "part=b").mkdir(exist_ok=True)

table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "part=a" / "data.parquet")

table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "part=b" / "data.parquet")
{code}

Reading as is (not specifying a partitioning, so default to no partitioning) will at least give an error about a missing field:

{code: python}
>>> dataset = ds.dataset(basedir)
>>> dataset.to_table(filter=ds.field("part") == "a")
...
ArrowInvalid: No match for FieldRef.Name(part) in a: int64
{code}

But specifying the partitioning field name (which currently gets (silently) interpreted as directory partitioning) gives a confusing empty result:

{code:python}
>>> dataset = ds.dataset(basedir, partitioning=["part"])
>>> dataset.to_table(filter=ds.field("part") == "a")
pyarrow.Table
a: int64
b: int64
part: string
----
a: []
b: []
part: []
{code}

This filter doesn't work because the values in the "part" column are not "a" but "part=a":

{code:python}
>>> dataset.to_table().to_pandas()
   a  b    part
0  1  1  part=a
1  2  2  part=a
2  3  3  part=a
3  4  1  part=b
4  5  2  part=b
5  6  3  part=b
{code}

  was:
When you have a hive-style partitioned dataset, with our current {{dataset(..)}} API, it's relatively easy to mess up the inferred partitioning and get confusing results. 

For example, if you specify the partitioning field names with {{partitioning=[...]}} (which is not needed for hive style since those are inferred), we actually assume you want directory partitioning. This DirectoryPartitioning will then parse the hive-style file paths and take the full "key=value" as the data values for the field.  
And then, doing a filter can result in a confusing empty result (because "value" doesn't match "key=value").

I am wondering if we can't relatively cheaply detect this case, and eg give an informative warning about this to the user. 

Basically what happens is this:

{code:python}
>>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
>>> part.parse("part=a")
<pyarrow.dataset.Expression (part == "part=a")>
{code}

If the parsed value is a string that contains a "=" (and in this case also contains the field name), that is I think a clear sign that (in the large majority of cases) the user is doing something wrong.

I am not fully sure where and at what stage the check could be done though. Doing it for every path in the dataset might be too costly.


----

Illustrative code example:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib

## constructing a small dataset with 1 hive-style partitioning level

basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
basedir.mkdir(exist_ok=True)

(basedir / "part=a").mkdir(exist_ok=True)
(basedir / "part=b").mkdir(exist_ok=True)

table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "part=a" / "data.parquet")

table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "part=b" / "data.parquet")
{code}

Reading as is (not specifying a partitioning, so default to no partitioning) will at least give an error about a missing field:

{code: python}
>>> dataset = ds.dataset(basedir)
>>> dataset.to_table(filter=ds.field("part") == "a")
...
ArrowInvalid: No match for FieldRef.Name(part) in a: int64
{code}

But specifying the partitioning field name (which currently gets (silently) interpreted as directory partitioning) gives a confusing empty result:

{code:python}
>>> dataset = ds.dataset(basedir, partitioning=["part"])
>>> dataset.to_table(filter=ds.field("part") == "a")
pyarrow.Table
a: int64
b: int64
part: string
----
a: []
b: []
part: []
{code}


> [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path?
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15310
>                 URL: https://issues.apache.org/jira/browse/ARROW-15310
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> When you have a hive-style partitioned dataset, with our current {{dataset(..)}} API, it's relatively easy to mess up the inferred partitioning and get confusing results. 
> For example, if you specify the partitioning field names with {{partitioning=[...]}} (which is not needed for hive style since those are inferred), we actually assume you want directory partitioning. This DirectoryPartitioning will then parse the hive-style file paths and take the full "key=value" as the data values for the field.  
> And then, doing a filter can result in a confusing empty result (because "value" doesn't match "key=value").
> I am wondering if we can't relatively cheaply detect this case, and eg give an informative warning about this to the user. 
> Basically what happens is this:
> {code:python}
> >>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
> >>> part.parse("part=a")
> <pyarrow.dataset.Expression (part == "part=a")>
> {code}
> If the parsed value is a string that contains a "=" (and in this case also contains the field name), that is I think a clear sign that (in the large majority of cases) the user is doing something wrong.
> I am not fully sure where and at what stage the check could be done though. Doing it for every path in the dataset might be too costly.
> ----
> Illustrative code example:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> import pathlib
> ## constructing a small dataset with 1 hive-style partitioning level
> basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
> basedir.mkdir(exist_ok=True)
> (basedir / "part=a").mkdir(exist_ok=True)
> (basedir / "part=b").mkdir(exist_ok=True)
> table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
> pq.write_table(table1, basedir / "part=a" / "data.parquet")
> table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
> pq.write_table(table2, basedir / "part=b" / "data.parquet")
> {code}
> Reading as is (not specifying a partitioning, so default to no partitioning) will at least give an error about a missing field:
> {code: python}
> >>> dataset = ds.dataset(basedir)
> >>> dataset.to_table(filter=ds.field("part") == "a")
> ...
> ArrowInvalid: No match for FieldRef.Name(part) in a: int64
> {code}
> But specifying the partitioning field name (which currently gets (silently) interpreted as directory partitioning) gives a confusing empty result:
> {code:python}
> >>> dataset = ds.dataset(basedir, partitioning=["part"])
> >>> dataset.to_table(filter=ds.field("part") == "a")
> pyarrow.Table
> a: int64
> b: int64
> part: string
> ----
> a: []
> b: []
> part: []
> {code}
> This filter doesn't work because the values in the "part" column are not "a" but "part=a":
> {code:python}
> >>> dataset.to_table().to_pandas()
>    a  b    part
> 0  1  1  part=a
> 1  2  2  part=a
> 2  3  3  part=a
> 3  4  1  part=b
> 4  5  2  part=b
> 5  6  3  part=b
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)