You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/07/02 14:05:00 UTC
[jira] [Closed] (ARROW-16720) [R] Cannot read datasets partitioned by columns starting with dots
[ https://issues.apache.org/jira/browse/ARROW-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson closed ARROW-16720.
-----------------------------------
Fix Version/s: 9.0.0
Resolution: Fixed
> [R] Cannot read datasets partitioned by columns starting with dots
> ------------------------------------------------------------------
>
> Key: ARROW-16720
> URL: https://issues.apache.org/jira/browse/ARROW-16720
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 8.0.0
> Environment: #> - Session info ---------------------------------------------------------------
> #> setting value
> #> version R version 4.1.1 (2021-08-10)
> #> os Windows 10 x64 (build 19044)
> #> system x86_64, mingw32
> #> ui RTerm
> #> language (EN)
> #> collate English_Switzerland.1252
> #> ctype C
> #> tz Europe/Berlin
> #> date 2022-06-02
> #>
> #> - Packages -------------------------------------------------------------------
> #> package * version date (UTC) lib source
> #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2)
> #> cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.3)
> #> crayon 1.5.0 2022-02-14 [1] CRAN (R 4.1.1)
> #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2)
> #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
> #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.1)
> #> fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.2)
> #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.1)
> #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
> #> glue 1.6.1 2022-01-22 [1] CRAN (R 4.1.2)
> #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.1)
> #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
> #> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.2)
> #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1)
> #> magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.2)
> #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
> #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.1)
> #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
> #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.1)
> #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1)
> #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1)
> #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.1)
> #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1)
> #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.3)
> #> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.0)
> #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
> #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
> #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
> #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.1)
> #> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.1)
> #> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.2)
> #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2)
> #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.1.1)
> #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.3)
> #> xfun 0.29 2021-12-14 [1] CRAN (R 4.1.2)
> #> yaml 2.2.2 2022-01-25 [1] CRAN (R 4.1.2)
> Reporter: Lorenzo Gaborini
> Priority: Minor
> Fix For: 9.0.0
>
>
> As in the title.
> It might be due to the fact that files starting with dots are hidden.
> No issues if the dot appears elsewhere.
> Reprex:
> {code:r}
> library(dplyr)
> library(arrow)
> packageVersion("arrow")
> #> [1] '8.0.0'
> path_arrow_tmp <- tempfile()
> mtcars %>%
> dplyr::group_by(cyl) %>%
> arrow::write_dataset(
> path = path_arrow_tmp
> )
> base::list.files(path_arrow_tmp, recursive = TRUE, all.files = TRUE)
> #> [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet" "cyl=8/part-0.parquet"
> mtcars_load <- path_arrow_tmp %>%
> arrow::open_dataset() %>%
> dplyr::collect()
> setequal(mtcars$mpg, mtcars_load$mpg)
> #> [1] TRUE
> # Change grouping by ".cyl"
> path_arrow_tmp_grp <- tempfile()
> mtcars %>%
> dplyr::mutate(.cyl = cyl) %>%
> dplyr::group_by(.cyl) %>%
> arrow::write_dataset(
> path = path_arrow_tmp_grp
> )
> # the files are there
> base::list.files(path_arrow_tmp_grp, recursive = TRUE, all.files = TRUE)
> #> [1] ".cyl=4/part-0.parquet" ".cyl=6/part-0.parquet" ".cyl=8/part-0.parquet"
> # 0 files detected
> path_arrow_tmp_grp %>%
> arrow::open_dataset()
> #> FileSystemDataset with 0 Parquet files
> # Specify partitioning manually
> # still no files
> path_arrow_tmp_grp %>%
> arrow::open_dataset(
> partitioning = ".cyl",
> hive_style = TRUE
> )
> #> FileSystemDataset with 0 Parquet files
> #> .cyl: int32
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)