You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/06/02 13:18:00 UTC

[jira] [Commented] (ARROW-16720) [R] Cannot read datasets partitioned by columns starting with dots

    [ https://issues.apache.org/jira/browse/ARROW-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545470#comment-17545470 ] 

Neal Richardson commented on ARROW-16720:
-----------------------------------------

By default, the dataset file discovery ignores files and directories that start with . or _. A recent, not yet released change (ARROW-15280) enables you to override this by providing {{factory_options}} (example [here|https://github.com/apache/arrow/pull/13171/files#diff-79100695986bbd6a63704fe9f238ce3ae9a39ddd093b7f6b213d4a722309d20aR1147-R1153]). Could you try [installing a nightly build of the package|https://arrow.apache.org/docs/r/#installing-a-development-version] and see if you can read your dataset by providing that option?

> [R] Cannot read datasets partitioned by columns starting with dots
> ------------------------------------------------------------------
>
>                 Key: ARROW-16720
>                 URL: https://issues.apache.org/jira/browse/ARROW-16720
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 8.0.0
>         Environment: #> - Session info ---------------------------------------------------------------
> #>  setting  value
> #>  version  R version 4.1.1 (2021-08-10)
> #>  os       Windows 10 x64 (build 19044)
> #>  system   x86_64, mingw32
> #>  ui       RTerm
> #>  language (EN)
> #>  collate  English_Switzerland.1252
> #>  ctype    C
> #>  tz       Europe/Berlin
> #>  date     2022-06-02
> #> 
> #> - Packages -------------------------------------------------------------------
> #>  package     * version date (UTC) lib source
> #>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.1.2)
> #>  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.3)
> #>  crayon        1.5.0   2022-02-14 [1] CRAN (R 4.1.1)
> #>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
> #>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
> #>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.1)
> #>  fansi         1.0.2   2022-01-14 [1] CRAN (R 4.1.2)
> #>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.1)
> #>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
> #>  glue          1.6.1   2022-01-22 [1] CRAN (R 4.1.2)
> #>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.1)
> #>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
> #>  knitr         1.37    2021-12-16 [1] CRAN (R 4.1.2)
> #>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
> #>  magrittr      2.0.2   2022-01-26 [1] CRAN (R 4.1.2)
> #>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
> #>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.1)
> #>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.2)
> #>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.1.1)
> #>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
> #>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
> #>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.1.1)
> #>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.1)
> #>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.1.3)
> #>  rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.1.0)
> #>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
> #>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
> #>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
> #>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.1)
> #>  styler        1.6.2   2021-09-23 [1] CRAN (R 4.1.1)
> #>  tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
> #>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
> #>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.1.1)
> #>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.3)
> #>  xfun          0.29    2021-12-14 [1] CRAN (R 4.1.2)
> #>  yaml          2.2.2   2022-01-25 [1] CRAN (R 4.1.2)
>            Reporter: Lorenzo Gaborini
>            Priority: Minor
>
> As in the title.  
> It might be due to the fact that files starting with dots are hidden.
> No issues if the dot appears elsewhere.
> Reprex:
> {code:r}
> library(dplyr)
> library(arrow)
> packageVersion("arrow")
> #> [1] '8.0.0'
> path_arrow_tmp <- tempfile()
> mtcars %>% 
>    dplyr::group_by(cyl) %>% 
>    arrow::write_dataset(
>       path = path_arrow_tmp
>    )
> base::list.files(path_arrow_tmp, recursive = TRUE, all.files = TRUE)
> #> [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet" "cyl=8/part-0.parquet"
> mtcars_load <- path_arrow_tmp %>% 
>    arrow::open_dataset() %>% 
>    dplyr::collect()
> setequal(mtcars$mpg, mtcars_load$mpg)
> #> [1] TRUE
> # Change grouping by ".cyl"
> path_arrow_tmp_grp <- tempfile()
> mtcars %>% 
>    dplyr::mutate(.cyl = cyl) %>% 
>    dplyr::group_by(.cyl) %>% 
>    arrow::write_dataset(
>       path = path_arrow_tmp_grp
>    )
> # the files are there
> base::list.files(path_arrow_tmp_grp, recursive = TRUE, all.files = TRUE)
> #> [1] ".cyl=4/part-0.parquet" ".cyl=6/part-0.parquet" ".cyl=8/part-0.parquet"
> # 0 files detected
> path_arrow_tmp_grp %>% 
>    arrow::open_dataset()
> #> FileSystemDataset with 0 Parquet files
> # Specify partitioning manually
> # still no files
> path_arrow_tmp_grp %>% 
>    arrow::open_dataset(
>       partitioning = ".cyl",
>       hive_style = TRUE
>    )
> #> FileSystemDataset with 0 Parquet files
> #> .cyl: int32
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)