You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/11/08 19:16:00 UTC

[jira] [Commented] (ARROW-10485) [R] open_dataset(): specifying partition when hive_style =TRUE fails silently

    [ https://issues.apache.org/jira/browse/ARROW-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440691#comment-17440691 ] 

Weston Pace commented on ARROW-10485:
-------------------------------------

Performance-wise it wouldn't be a difficult check though I think it'd have to be a warning and we don't have a mechanism for communicating those from C++.  Technically, while odd, it should be valid for a user to have an "=" character in a directory partitioning scheme.

Would the following be more intuitive to add a hive_style argument to open_dataset and, if set to true (the default), ignore partitioning if it is a character vector or, if it is a schema, construct a HivePartitioning and send that in.

> [R] open_dataset(): specifying partition when hive_style =TRUE fails silently
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-10485
>                 URL: https://issues.apache.org/jira/browse/ARROW-10485
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 2.0.0
>         Environment: MacOS Catalina 10.15.7 (19H2), R 4.01, arrow R package v2.0.0
>            Reporter: John Sheffield
>            Assignee: Ben Kietzman
>            Priority: Minor
>
> When writing a dataset with hive_style = TRUE, now the default, that dataset has to be opened without an explicit definition of the partitions to work as expected. Even if the correct partition is specified, any query to the dataset on the partition field returns 0 rows.
>  
> From my eyes as a user, I'd want this to error out specifically (not just warn), probably when first calling open_dataset().
> {code:r}
> data("mtcars")
> arrow::write_dataset(
>     dataset = mtcars, path = "mtcarstest", partitioning = "cyl",
>     format = "parquet", hive_style = TRUE)
> mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl")
> mtc2 <- arrow::open_dataset("mtcarstest")
> mtc1 %>%
>      dplyr::filter(cyl == 4) %>%
>      collect()
> mtc2 %>%
>      dplyr::filter(cyl == 4) %>%
>      collect()
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)