You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ian Cook (Jira)" <ji...@apache.org> on 2021/02/18 05:15:00 UTC

[jira] [Commented] (ARROW-11413) [R] dplyr filter is not working for datasets

    [ https://issues.apache.org/jira/browse/ARROW-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286277#comment-17286277 ] 

Ian Cook commented on ARROW-11413:
----------------------------------

[~kbzsl] Sometimes problems like this happen because of thread deadlocks. Could you try configuring Arrow to use only one virtual CPU core, and see if that resolves the hanging behavior?

You can see how many virtual CPU cores Arrow is able to use by running 
{code:java}
arrow::cpu_count() {code}
and you can configure Arrow to use only one virtual CPU core by running:
{code:java}
arrow::set_cpu_count(1) {code}
 

> [R] dplyr filter is not working for datasets 
> ---------------------------------------------
>
>                 Key: ARROW-11413
>                 URL: https://issues.apache.org/jira/browse/ARROW-11413
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 2.0.0, 3.0.0
>         Environment: i7, windows 10 laptop
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>
> I was trying to recreate the [vignette|https://arrow.apache.org/docs/r/articles/dataset.html] on datasets and dplyr on a win10 machine. I downloaded the data for 2 consecutive years (2017, 2018) to my laptop.
> The filter is working only for variables used for partitioning. When I am inserting any other variable (like the total_amount) the R/RStudio session hangs: no error message and more interestingly no detectable CPU load nor disk usage (task manager) for many minutes.  
> I experienced the same issue both with arrow 2.0.0 and 3.0.0 (just I update my R packages this morning). Previously, I already tried to reinstall the arrow 2.0.0 package.
> Did I misunderstand something in the vignette? Is there any OS limitation?
>  
> {code:java}
> // 
> > library(arrow)Attaching package: 'arrow'The following object is masked from 'package:utils':    timestamp> library(tidyverse)
> -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.3.0 --
> v ggplot2 3.3.3     v purrr   0.3.4
> v tibble  3.0.5     v dplyr   1.0.3
> v tidyr   1.1.2     v stringr 1.4.0
> v readr   1.4.0     v forcats 0.5.1
> -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
> x dplyr::filter() masks stats::filter()
> x dplyr::lag()    masks stats::lag()
> > arrow_available()
> [1] TRUE
> > arrow_info()
> Arrow package version: 3.0.0Capabilities:
>                
> s3         TRUE
> snappy     TRUE
> gzip       TRUE
> brotli    FALSE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2       FALSE
> jemalloc  FALSE
> mimalloc   TRUEMemory:
>                   
> Allocator mimalloc
> Current    0 bytes
> Max        0 bytes> 
> > ds <- open_dataset(taxidir, partitioning = c("year", "month"))
> > ds
> FileSystemDataset with 24 Parquet files
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> rate_code_id: string
> store_and_fwd_flag: string
> pickup_location_id: int32
> dropoff_location_id: int32
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> improvement_surcharge: float
> total_amount: float
> year: int32
> month: int32See $metadata for additional Schema metadata
> > 
> > a <- ds %>% 
> +   select(year, total_amount) %>% collect()
> > 
> > b <- ds %>% 
> +   filter(year == 2018) %>% 
> +   select(year, total_amount) %>% collect()
> > 
> > c <- ds %>% 
> +   filter(total_amount > 100) %>% 
> +   select(year, total_amount) %>% collect(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)