You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2020/06/16 16:58:02 UTC
[jira] [Resolved] (ARROW-8613) [C++][Dataset] Raise error for
unparsable partition value
[ https://issues.apache.org/jira/browse/ARROW-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Kietzman resolved ARROW-8613.
---------------------------------
Resolution: Fixed
Issue resolved by pull request 7440
[https://github.com/apache/arrow/pull/7440]
> [C++][Dataset] Raise error for unparsable partition value
> ---------------------------------------------------------
>
> Key: ARROW-8613
> URL: https://issues.apache.org/jira/browse/ARROW-8613
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Ben Kietzman
> Priority: Major
> Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Currently, when specifying a partitioning schema, but on of the partition field values cannot be parsed according to the specified type, you silently get null values for that partition field.
> Python example:
> {code:python}
> import pathlib
> import pyarrow.parquet as pq
> import pyarrow.datasets as d
> path = pathlib.Path(".") / "dataset_partition_schema_errors"
> path.mkdir(exist_ok=True)
> table = pa.table({"part": ["1_2", "1_2", "3_4", "3_4"], "values": range(4)})
> pq.write_to_dataset(table, str(path), partition_cols=["part"])
> {code}
> {code:java}
> In [17]: ds.dataset(path, partitioning="hive").to_table().to_pandas()
> Out[17]:
> values part
> 0 0 1_2
> 1 1 1_2
> 2 2 3_4
> 3 3 3_4
> In [18]: partitioning = ds.partitioning(pa.schema([("part", pa.int64())]), flavor="hive")
> In [19]: ds.dataset(path, partitioning=partitioning).to_table().to_pandas()
> Out[19]:
> values part
> 0 0 NaN
> 1 1 NaN
> 2 2 NaN
> 3 3 NaN
> {code}
> Silently ignoring such a parse error doesn't seem the best default to me (since partition keys are quite essential). I think raising an error might be better?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)