You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/10/26 14:11:00 UTC
[jira] [Resolved] (ARROW-17187) [R] Improve lazy ALTREP implementation for String

     [ https://issues.apache.org/jira/browse/ARROW-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dewey Dunnington resolved ARROW-17187.
--------------------------------------
    Resolution: Fixed

Issue resolved by pull request 14271
[https://github.com/apache/arrow/pull/14271]

> [R] Improve lazy ALTREP implementation for String
> -------------------------------------------------
>
>                 Key: ARROW-17187
>                 URL: https://issues.apache.org/jira/browse/ARROW-17187
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Dewey Dunnington
>            Assignee: Dewey Dunnington
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 11.0.0
>
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> ARROW-16578 noted that there was a high cost to looping through an ALTREP character vector that we created in the arrow R package. The temporary workaround is to materialize whenever the first element is requested, which is much faster than our initial implementation but is probably not necessary given that other ALTREP character implementations appear to not have this issue:
> (Timings before merging ARROW-16578, which reduces the 5 second operation below to 0.05 seconds).
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
> df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
> write_parquet(df1,"/tmp/test.parquet")
> df2 <- read_parquet("/tmp/test.parquet")
> system.time(unique(df1$x))
> #>    user  system elapsed 
> #>   0.022   0.001   0.023
> system.time(unique(df2$x))
> #>    user  system elapsed 
> #>   4.529   0.680   5.226
> # the speed is almost certainly not due to ALTREP itself
> # but is probably something to do with our implementation
> tf <- tempfile()
> readr::write_csv(df1, tf)
> df3 <- vroom::vroom(tf, delim = ",", altrep = TRUE)
> #> Rows: 1000000 Columns: 1
> #> ── Column specification ────────────────────────────────────────────────────────
> #> Delimiter: ","
> #> dbl (1): x
> #> 
> #> ℹ Use `spec()` to retrieve the full column specification for this data.
> #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> .Internal(inspect(df3$x))
> #> @2d2042048 14 REALSXP g0c0 [REF(65535)] vroom_dbl (len=1000000, materialized=F)
> system.time(unique(df3$x))
> #>    user  system elapsed 
> #>   0.127   0.001   0.128
> .Internal(inspect(df3$x))
> #> @2d2042048 14 REALSXP g1c0 [MARK,REF(65535)] vroom_dbl (len=1000000, materialized=F)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)