You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/04/04 12:50:00 UTC
[jira] [Comment Edited] (ARROW-15260) [R] open_dataset - add file_name as column
[ https://issues.apache.org/jira/browse/ARROW-15260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516796#comment-17516796 ]
Dewey Dunnington edited comment on ARROW-15260 at 4/4/22 12:49 PM:
-------------------------------------------------------------------
(JIRA does wild things with italics here, so I'm sticking it in noformat...)
{noformat}
It looks like we can access {{__filename}} from the {{Scanner}} too, although it's pretty limited what we do with it. Note that in R you will have to use backticks in something like dplyr (e.g., {{`__filename`}}, because variables in R can't start with {{_}}. In the dplyr interface we make a pretty strong assumption that the schema names are the available names in the dataset...maybe the best way would be to add a binding like {{dataset_filename()}} that inserts the correct field reference (although C++ gives us errors if we try to insert a field reference to {{__filename}} in an {{Expression}}).
{noformat}
{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
tf <- tempfile()
write_dataset(mtcars, tf, partitioning = "cyl")
ds <- open_dataset(tf)
# works!
scanner <- Scanner$create(
open_dataset(tf),
projection = c("__filename", names(ds))
)
as_tibble(scanner$ToTable())
#> # A tibble: 32 × 12
#> `__filename` mpg disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 /private/var/fol… 22.8 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 /private/var/fol… 24.4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 /private/var/fol… 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
#> 4 /private/var/fol… 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 5 /private/var/fol… 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 6 /private/var/fol… 33.9 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 7 /private/var/fol… 21.5 120. 97 3.7 2.46 20.0 1 0 3 1
#> 8 /private/var/fol… 27.3 79 66 4.08 1.94 18.9 1 1 4 1
#> 9 /private/var/fol… 26 120. 91 4.43 2.14 16.7 0 1 5 2
#> 10 /private/var/fol… 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> # … with 22 more rows, and 1 more variable: cyl <int>
# seems that we still can't use __filename in a filter expr
Scanner$create(
open_dataset(tf),
projection = c("__filename", names(ds)),
filter = Expression$create(
"match_substring",
Expression$field_ref("__filename"),
options = list(pattern = "cyl=8")
)
)
#> Error: Invalid: No match for FieldRef.Name(__filename) in mpg: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: int32
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1717 CheckNonEmpty(matches, root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/dataset/scanner.cc:782 ref.FindOne(*scan_options_->dataset_schema)
{code}
was (Author: paleolimbot):
(JIRA does wild things with italics here, so I'm sticking it in noformat...)
{{noformat}}
It looks like we can access {{__filename}} from the {{Scanner}} too, although it's pretty limited what we do with it. Note that in R you will have to use backticks in something like dplyr (e.g., {{`__filename`}}, because variables in R can't start with {{_}}. In the dplyr interface we make a pretty strong assumption that the schema names are the available names in the dataset...maybe the best way would be to add a binding like {{dataset_filename()}} that inserts the correct field reference (although C++ gives us errors if we try to insert a field reference to {{__filename}} in an {{Expression}}).
{{noformat}}
{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
tf <- tempfile()
write_dataset(mtcars, tf, partitioning = "cyl")
ds <- open_dataset(tf)
# works!
scanner <- Scanner$create(
open_dataset(tf),
projection = c("__filename", names(ds))
)
as_tibble(scanner$ToTable())
#> # A tibble: 32 × 12
#> `__filename` mpg disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 /private/var/fol… 22.8 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 /private/var/fol… 24.4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 /private/var/fol… 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
#> 4 /private/var/fol… 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 5 /private/var/fol… 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 6 /private/var/fol… 33.9 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 7 /private/var/fol… 21.5 120. 97 3.7 2.46 20.0 1 0 3 1
#> 8 /private/var/fol… 27.3 79 66 4.08 1.94 18.9 1 1 4 1
#> 9 /private/var/fol… 26 120. 91 4.43 2.14 16.7 0 1 5 2
#> 10 /private/var/fol… 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> # … with 22 more rows, and 1 more variable: cyl <int>
# seems that we still can't use __filename in a filter expr
Scanner$create(
open_dataset(tf),
projection = c("__filename", names(ds)),
filter = Expression$create(
"match_substring",
Expression$field_ref("__filename"),
options = list(pattern = "cyl=8")
)
)
#> Error: Invalid: No match for FieldRef.Name(__filename) in mpg: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: int32
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1717 CheckNonEmpty(matches, root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/dataset/scanner.cc:782 ref.FindOne(*scan_options_->dataset_schema)
{code}
> [R] open_dataset - add file_name as column
> ------------------------------------------
>
> Key: ARROW-15260
> URL: https://issues.apache.org/jira/browse/ARROW-15260
> Project: Apache Arrow
> Issue Type: New Feature
> Components: R
> Reporter: Martin du Toit
> Priority: Minor
>
> Hi. Is it possible to add the file_name as a column to a dataset?
> {code:r}
> ds <- open_dataset(.....)
> list_of_files <- ds$files
> {code}
> This works, but I need the file_name as a column.
> Thanks
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)