You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/04/28 10:41:00 UTC
[jira] [Created] (ARROW-8613) [C++][Dataset] Raise error for
unparsable partition value
Joris Van den Bossche created ARROW-8613:
--------------------------------------------
Summary: [C++][Dataset] Raise error for unparsable partition value
Key: ARROW-8613
URL: https://issues.apache.org/jira/browse/ARROW-8613
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: Joris Van den Bossche
Fix For: 1.0.0
Currently, when specifying a partitioning schema, but on of the partition field values cannot be parsed according to the specified type, you silently get null values for that partition field.
Python example:
{code:python}
import pathlib
import pyarrow.parquet as pq
import pyarrow.datasets as d
path = pathlib.Path(".") / "dataset_partition_schema_errors"
path.mkdir(exist_ok=True)
table = pa.table({"part": ["1_2", "1_2", "3_4", "3_4"], "values": range(4)})
pq.write_to_dataset(table, str(path), partition_cols=["part"])
{code}
{code:java}
In [17]: ds.dataset(path, partitioning="hive").to_table().to_pandas()
Out[17]:
values part
0 0 1_2
1 1 1_2
2 2 3_4
3 3 3_4
In [18]: partitioning = ds.partitioning(pa.schema([("part", pa.int64())]), flavor="hive")
In [19]: ds.dataset(path, partitioning=partitioning).to_table().to_pandas()
Out[19]:
values part
0 0 NaN
1 1 NaN
2 2 NaN
3 3 NaN
{code}
Silently ignoring such a parse error doesn't seem the best default to me (since partition keys are quite essential). I think raising an error might be better?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)