You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Zsolt Kegyes-Brassai (Jira)" <ji...@apache.org> on 2021/01/29 17:38:00 UTC
[jira] [Commented] (ARROW-11413) [R] dplyr filter is not working for datasets

    [ https://issues.apache.org/jira/browse/ARROW-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275215#comment-17275215 ] 

Zsolt Kegyes-Brassai commented on ARROW-11413:
----------------------------------------------

Thank you. 

I have a laptop with an i7 processor, 32GB RAM and SSD, running on win10 (build 17763). My R environment is quite up to date.

 
{code:java}
> RStudio.Version()$version
[1] ‘1.3.1093’

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)Matrix products: defaultlocale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     other attached packages:
 [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.3     purrr_0.3.4     readr_1.4.0     tidyr_1.1.2    
 [7] tibble_3.0.5    ggplot2_3.3.3   tidyverse_1.3.0 arrow_3.0.0    loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        cellranger_1.1.0  pillar_1.4.7      compiler_4.0.3    dbplyr_2.0.0     
 [6] lobstr_1.1.1      tools_4.0.3       bit_4.0.4         lubridate_1.7.9.2 jsonlite_1.7.2   
[11] lifecycle_0.2.0   gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.10      reprex_1.0.0     
[16] cli_2.2.0         DBI_1.1.1         rstudioapi_0.13   haven_2.3.1       withr_2.4.1      
[21] xml2_1.3.2        httr_1.4.2        fs_1.5.0          generics_0.1.0    vctrs_0.3.6      
[26] tictoc_1.0        hms_1.0.0         bit64_4.0.5       grid_4.0.3        tidyselect_1.1.0 
[31] glue_1.4.2        R6_2.5.0          fansi_0.4.2       readxl_1.3.1      modelr_0.1.8     
[36] magrittr_2.0.1    backports_1.2.0   scales_1.1.1      ellipsis_0.3.1    rvest_0.3.6      
[41] assertthat_0.2.1  colorspace_2.0-0  stringi_1.5.3     munsell_0.5.0     broom_0.7.3      
[46] crayon_1.3.4 
{code}
I can load these two columns quite fast without the filter command, but adding the filter nothing happens (just a hanging session).
{code:java}
 > tictoc::tic()
> a <- ds %>% 
+   select(year, total_amount) %>% collect()
> tictoc::toc()
1.22 sec elapsed
> nrow(a)
[1] 216301124
> lobstr::obj_size(a)
2,595,614,480 B
{code}
 

> [R] dplyr filter is not working for datasets 
> ---------------------------------------------
>
>                 Key: ARROW-11413
>                 URL: https://issues.apache.org/jira/browse/ARROW-11413
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 2.0.0, 3.0.0
>         Environment: i7, windows 10 laptop
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>
> I was trying to recreate the [vignette|https://arrow.apache.org/docs/r/articles/dataset.html] on datasets and dplyr on a win10 machine. I downloaded the data for 2 consecutive years (2017, 2018) to my laptop.
> The filter is working only for variables used for partitioning. When I am inserting any other variable (like the total_amount) the R/RStudio session hangs: no error message and more interestingly no detectable CPU load nor disk usage (task manager) for many minutes.  
> I experienced the same issue both with arrow 2.0.0 and 3.0.0 (just I update my R packages this morning). Previously, I already tried to reinstall the arrow 2.0.0 package.
> Did I misunderstand something in the vignette? Is there any OS limitation?
>  
> {code:java}
> // 
> > library(arrow)Attaching package: 'arrow'The following object is masked from 'package:utils':    timestamp> library(tidyverse)
> -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.3.0 --
> v ggplot2 3.3.3     v purrr   0.3.4
> v tibble  3.0.5     v dplyr   1.0.3
> v tidyr   1.1.2     v stringr 1.4.0
> v readr   1.4.0     v forcats 0.5.1
> -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
> x dplyr::filter() masks stats::filter()
> x dplyr::lag()    masks stats::lag()
> > arrow_available()
> [1] TRUE
> > arrow_info()
> Arrow package version: 3.0.0Capabilities:
>                
> s3         TRUE
> snappy     TRUE
> gzip       TRUE
> brotli    FALSE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2       FALSE
> jemalloc  FALSE
> mimalloc   TRUEMemory:
>                   
> Allocator mimalloc
> Current    0 bytes
> Max        0 bytes> 
> > ds <- open_dataset(taxidir, partitioning = c("year", "month"))
> > ds
> FileSystemDataset with 24 Parquet files
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> rate_code_id: string
> store_and_fwd_flag: string
> pickup_location_id: int32
> dropoff_location_id: int32
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> improvement_surcharge: float
> total_amount: float
> year: int32
> month: int32See $metadata for additional Schema metadata
> > 
> > a <- ds %>% 
> +   select(year, total_amount) %>% collect()
> > 
> > b <- ds %>% 
> +   filter(year == 2018) %>% 
> +   select(year, total_amount) %>% collect()
> > 
> > c <- ds %>% 
> +   filter(total_amount > 100) %>% 
> +   select(year, total_amount) %>% collect(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)