You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/01/21 18:41:00 UTC
[jira] [Updated] (ARROW-15406) [Python] Change the default read partitioning flavor to hive
[ https://issues.apache.org/jira/browse/ARROW-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace updated ARROW-15406:
--------------------------------
Description:
Currently the default for reading datasets is to do no partitioning. So given the dataset:
/foo=1/part0.parquet
/foo=2/part0.parquet
it will not detect the "foo" partition. Changing the default to hive should be harmless in most cases (the only way it could be a problem is if a user had x=y in their directory name and it wasn't intended to be a partition).
This may put us at odds with the default partitioning for writes (ARROW-15407) but specifying "partitioning=hive" on a directory partitioned dataset is no worse than specifying "partitioning=None" on a directory partitioned dataset which is what we do today.
was:
Currently the default for reading datasets is to do no partitioning. So given the dataset:
/foo=1/part0.parquet
/foo=2/part0.parquet
it will not detect the "foo" partition. Changing the default to hive should be harmless in most cases (the only way it could be a problem is if a user had x=y in their directory name and it wasn't intended to be a partition).
This may put us at odds with the default partitioning for writes (I'm opening a separate JIRA for that) but specifying "partitioning=hive" on a directory partitioned dataset is no worse than specifying "partitioning=None" on a directory partitioned dataset which is what we do today.
> [Python] Change the default read partitioning flavor to hive
> ------------------------------------------------------------
>
> Key: ARROW-15406
> URL: https://issues.apache.org/jira/browse/ARROW-15406
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Weston Pace
> Priority: Major
>
> Currently the default for reading datasets is to do no partitioning. So given the dataset:
> /foo=1/part0.parquet
> /foo=2/part0.parquet
> it will not detect the "foo" partition. Changing the default to hive should be harmless in most cases (the only way it could be a problem is if a user had x=y in their directory name and it wasn't intended to be a partition).
> This may put us at odds with the default partitioning for writes (ARROW-15407) but specifying "partitioning=hive" on a directory partitioned dataset is no worse than specifying "partitioning=None" on a directory partitioned dataset which is what we do today.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)