You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/03/09 09:49:00 UTC

[jira] [Created] (ARROW-15880) [C++] Can't open partitioned dataset if the root directory has "=" in its name

Nicola Crane created ARROW-15880:
------------------------------------

             Summary: [C++] Can't open partitioned dataset if the root directory has "=" in its name
                 Key: ARROW-15880
                 URL: https://issues.apache.org/jira/browse/ARROW-15880
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Nicola Crane


Not sure if this is a bug or "just how Hive style partitioning works" but if I try to open a dataset where the root directory has an "=" in it, I have to specify that directory in my partitioning to be able to successfully open it.

This has caused users to trip up when they've saved one directory from a partitioned dataset somewhere and tried to then open this directory as a dataset.

{code:r}
library(arrow)
td <- tempfile()
dir.create(td)
# directory with equals sign in name
subdir <- file.path(td, "foo=bar")
dir.create(subdir)
write_dataset(mtcars, subdir, partitioning = "am")
list.files(td, recursive = TRUE)
#> [1] "foo=bar/am=0/part-0.parquet" "foo=bar/am=1/part-0.parquet"
# doesn't work
open_dataset(subdir, partitioning = "am")
#> Error:
#> ! "partitioning" does not match the detected Hive-style partitions: c("foo", "am")
#> ℹ Omit "partitioning" to use the Hive partitions
#> ℹ Set `hive_style = FALSE` to override what was detected
#> ℹ Or, to rename partition columns, call `select()` or `rename()` after opening the dataset
# works
open_dataset(subdir, partitioning = c("foo", "am"))
#> FileSystemDataset with 2 Parquet files
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> gear: double
#> carb: double
#> foo: string
#> am: int32
#> 
#> See $metadata for additional Schema metadata
{code}

Compare this with the same example but the folder is just called "foobar" instead of "foo=bar".

{code:r}
td <- tempfile()
dir.create(td)
subdir <- file.path(td, "foobar")
dir.create(subdir)
write_dataset(mtcars, subdir, partitioning = "am")
list.files(td, recursive = TRUE)
#> [1] "foobar/am=0/part-0.parquet" "foobar/am=1/part-0.parquet"
# works
open_dataset(subdir, partitioning = "am")
#> FileSystemDataset with 2 Parquet files
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> gear: double
#> carb: double
#> am: int32
#> 
#> See $metadata for additional Schema metadata
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)