You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2020/06/16 12:45:00 UTC

[jira] [Assigned] (ARROW-8613) [C++][Dataset] Raise error for unparsable partition value

     [ https://issues.apache.org/jira/browse/ARROW-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ben Kietzman reassigned ARROW-8613:
-----------------------------------

    Assignee: Ben Kietzman

> [C++][Dataset] Raise error for unparsable partition value
> ---------------------------------------------------------
>
>                 Key: ARROW-8613
>                 URL: https://issues.apache.org/jira/browse/ARROW-8613
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Kietzman
>            Priority: Major
>              Labels: dataset
>             Fix For: 2.0.0
>
>
> Currently, when specifying a partitioning schema, but on of the partition field values cannot be parsed according to the specified type, you silently get null values for that partition field.
> Python example:
> {code:python}
> import pathlib              
> import pyarrow.parquet as pq 
> import pyarrow.datasets as d
> path = pathlib.Path(".") / "dataset_partition_schema_errors" 
> path.mkdir(exist_ok=True)                                                                                                                                                                                  
> table = pa.table({"part": ["1_2", "1_2", "3_4", "3_4"], "values": range(4)})   
> pq.write_to_dataset(table, str(path), partition_cols=["part"]) 
> {code}
> {code:java}
> In [17]: ds.dataset(path, partitioning="hive").to_table().to_pandas() 
> Out[17]: 
>    values part
> 0       0  1_2
> 1       1  1_2
> 2       2  3_4
> 3       3  3_4
> In [18]: partitioning = ds.partitioning(pa.schema([("part", pa.int64())]), flavor="hive")                                                                                                                          
> In [19]: ds.dataset(path, partitioning=partitioning).to_table().to_pandas()   
> Out[19]: 
>    values  part
> 0       0   NaN
> 1       1   NaN
> 2       2   NaN
> 3       3   NaN
> {code}
> Silently ignoring such a parse error doesn't seem the best default to me (since partition keys are quite essential). I think raising an error might be better? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)