You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/04/04 12:50:00 UTC

[jira] [Comment Edited] (ARROW-15260) [R] open_dataset - add file_name as column

    [ https://issues.apache.org/jira/browse/ARROW-15260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516796#comment-17516796 ] 

Dewey Dunnington edited comment on ARROW-15260 at 4/4/22 12:49 PM:
-------------------------------------------------------------------

(JIRA does wild things with italics here, so I'm sticking it in noformat...)

{noformat}

It looks like we can access {{__filename}} from the {{Scanner}} too, although it's pretty limited what we do with it. Note that in R you will have to use backticks in something like dplyr (e.g., {{`__filename`}}, because variables in R can't start with {{_}}. In the dplyr interface we make a pretty strong assumption that the schema names are the available names in the dataset...maybe the best way would be to add a binding like {{dataset_filename()}} that inserts the correct field reference (although C++ gives us errors if we try to insert a field reference to {{__filename}} in an {{Expression}}).

{noformat}

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

tf <- tempfile()
write_dataset(mtcars, tf, partitioning = "cyl")
ds <- open_dataset(tf)

# works!
scanner <- Scanner$create(
  open_dataset(tf), 
  projection = c("__filename", names(ds))
)

as_tibble(scanner$ToTable())
#> # A tibble: 32 × 12
#>    `__filename`        mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 /private/var/fol…  22.8 108      93  3.85  2.32  18.6     1     1     4     1
#>  2 /private/var/fol…  24.4 147.     62  3.69  3.19  20       1     0     4     2
#>  3 /private/var/fol…  22.8 141.     95  3.92  3.15  22.9     1     0     4     2
#>  4 /private/var/fol…  32.4  78.7    66  4.08  2.2   19.5     1     1     4     1
#>  5 /private/var/fol…  30.4  75.7    52  4.93  1.62  18.5     1     1     4     2
#>  6 /private/var/fol…  33.9  71.1    65  4.22  1.84  19.9     1     1     4     1
#>  7 /private/var/fol…  21.5 120.     97  3.7   2.46  20.0     1     0     3     1
#>  8 /private/var/fol…  27.3  79      66  4.08  1.94  18.9     1     1     4     1
#>  9 /private/var/fol…  26   120.     91  4.43  2.14  16.7     0     1     5     2
#> 10 /private/var/fol…  30.4  95.1   113  3.77  1.51  16.9     1     1     5     2
#> # … with 22 more rows, and 1 more variable: cyl <int>

# seems that we still can't use __filename in a filter expr
Scanner$create(
  open_dataset(tf),
  projection = c("__filename", names(ds)),
  filter = Expression$create(
    "match_substring",
    Expression$field_ref("__filename"),
    options = list(pattern = "cyl=8")
  )
)
#> Error: Invalid: No match for FieldRef.Name(__filename) in mpg: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: int32
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1717  CheckNonEmpty(matches, root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/dataset/scanner.cc:782  ref.FindOne(*scan_options_->dataset_schema)
{code}



was (Author: paleolimbot):
(JIRA does wild things with italics here, so I'm sticking it in noformat...)

{{noformat}}

It looks like we can access {{__filename}} from the {{Scanner}} too, although it's pretty limited what we do with it. Note that in R you will have to use backticks in something like dplyr (e.g., {{`__filename`}}, because variables in R can't start with {{_}}. In the dplyr interface we make a pretty strong assumption that the schema names are the available names in the dataset...maybe the best way would be to add a binding like {{dataset_filename()}} that inserts the correct field reference (although C++ gives us errors if we try to insert a field reference to {{__filename}} in an {{Expression}}).

{{noformat}}

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

tf <- tempfile()
write_dataset(mtcars, tf, partitioning = "cyl")
ds <- open_dataset(tf)

# works!
scanner <- Scanner$create(
  open_dataset(tf), 
  projection = c("__filename", names(ds))
)

as_tibble(scanner$ToTable())
#> # A tibble: 32 × 12
#>    `__filename`        mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 /private/var/fol…  22.8 108      93  3.85  2.32  18.6     1     1     4     1
#>  2 /private/var/fol…  24.4 147.     62  3.69  3.19  20       1     0     4     2
#>  3 /private/var/fol…  22.8 141.     95  3.92  3.15  22.9     1     0     4     2
#>  4 /private/var/fol…  32.4  78.7    66  4.08  2.2   19.5     1     1     4     1
#>  5 /private/var/fol…  30.4  75.7    52  4.93  1.62  18.5     1     1     4     2
#>  6 /private/var/fol…  33.9  71.1    65  4.22  1.84  19.9     1     1     4     1
#>  7 /private/var/fol…  21.5 120.     97  3.7   2.46  20.0     1     0     3     1
#>  8 /private/var/fol…  27.3  79      66  4.08  1.94  18.9     1     1     4     1
#>  9 /private/var/fol…  26   120.     91  4.43  2.14  16.7     0     1     5     2
#> 10 /private/var/fol…  30.4  95.1   113  3.77  1.51  16.9     1     1     5     2
#> # … with 22 more rows, and 1 more variable: cyl <int>

# seems that we still can't use __filename in a filter expr
Scanner$create(
  open_dataset(tf),
  projection = c("__filename", names(ds)),
  filter = Expression$create(
    "match_substring",
    Expression$field_ref("__filename"),
    options = list(pattern = "cyl=8")
  )
)
#> Error: Invalid: No match for FieldRef.Name(__filename) in mpg: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: int32
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1717  CheckNonEmpty(matches, root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/dataset/scanner.cc:782  ref.FindOne(*scan_options_->dataset_schema)
{code}


> [R] open_dataset - add file_name as column
> ------------------------------------------
>
>                 Key: ARROW-15260
>                 URL: https://issues.apache.org/jira/browse/ARROW-15260
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: R
>            Reporter: Martin du Toit
>            Priority: Minor
>
> Hi. Is it possible to add the file_name as a column to a dataset?
> {code:r}
> ds <- open_dataset(.....)
> list_of_files <- ds$files
> {code}
> This works, but I need the file_name as a column.
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)