You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/04/28 10:41:00 UTC

[jira] [Created] (ARROW-8613) [C++][Dataset] Raise error for unparsable partition value

Joris Van den Bossche created ARROW-8613:
--------------------------------------------

             Summary: [C++][Dataset] Raise error for unparsable partition value
                 Key: ARROW-8613
                 URL: https://issues.apache.org/jira/browse/ARROW-8613
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Joris Van den Bossche
             Fix For: 1.0.0


Currently, when specifying a partitioning schema, but on of the partition field values cannot be parsed according to the specified type, you silently get null values for that partition field.

Python example:
{code:python}
import pathlib              
import pyarrow.parquet as pq 
import pyarrow.datasets as d

path = pathlib.Path(".") / "dataset_partition_schema_errors" 
path.mkdir(exist_ok=True)                                                                                                                                                                                  

table = pa.table({"part": ["1_2", "1_2", "3_4", "3_4"], "values": range(4)})   
pq.write_to_dataset(table, str(path), partition_cols=["part"]) 
{code}
{code:java}
In [17]: ds.dataset(path, partitioning="hive").to_table().to_pandas() 
Out[17]: 
   values part
0       0  1_2
1       1  1_2
2       2  3_4
3       3  3_4

In [18]: partitioning = ds.partitioning(pa.schema([("part", pa.int64())]), flavor="hive")                                                                                                                          

In [19]: ds.dataset(path, partitioning=partitioning).to_table().to_pandas()   
Out[19]: 
   values  part
0       0   NaN
1       1   NaN
2       2   NaN
3       3   NaN
{code}

Silently ignoring such a parse error doesn't seem the best default to me (since partition keys are quite essential). I think raising an error might be better? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)