You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Zsolt Kegyes-Brassai (Jira)" <ji...@apache.org> on 2021/01/28 08:29:00 UTC

[jira] [Created] (ARROW-11413) dplyr filter is not working for datasets

Zsolt Kegyes-Brassai created ARROW-11413:
--------------------------------------------

             Summary: dplyr filter is not working for datasets 
                 Key: ARROW-11413
                 URL: https://issues.apache.org/jira/browse/ARROW-11413
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 3.0.0, 2.0.0
         Environment: i7, windows 10 laptop
            Reporter: Zsolt Kegyes-Brassai


I was trying to recreate the [vignette|https://arrow.apache.org/docs/r/articles/dataset.html] on datasets and dplyr on a win10 machine. I downloaded the data for 2 consecutive years (2017, 2018) to my laptop.

The filter is working only for variables used for partitioning. When I am inserting any other variable (like the total_amount) the R/RStudio session hangs: no error message and more interestingly no detectable CPU load nor disk usage (task manager) for many minutes.  

I experienced the same issue both with arrow 2.0.0 and 3.0.0 (just I update my R packages this morning). Previously, I already tried to reinstall the arrow 2.0.0 package.

Did I misunderstand something in the vignette? Is there any OS limitation?

 
{code:java}
// 
> library(arrow)Attaching package: 'arrow'The following object is masked from 'package:utils':    timestamp> library(tidyverse)
-- Attaching packages ---------------------------------------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.3     v purrr   0.3.4
v tibble  3.0.5     v dplyr   1.0.3
v tidyr   1.1.2     v stringr 1.4.0
v readr   1.4.0     v forcats 0.5.1
-- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
> arrow_available()
[1] TRUE
> arrow_info()
Arrow package version: 3.0.0Capabilities:
               
s3         TRUE
snappy     TRUE
gzip       TRUE
brotli    FALSE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2       FALSE
jemalloc  FALSE
mimalloc   TRUEMemory:
                  
Allocator mimalloc
Current    0 bytes
Max        0 bytes> 
> ds <- open_dataset(taxidir, partitioning = c("year", "month"))
> ds
FileSystemDataset with 24 Parquet files
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
rate_code_id: string
store_and_fwd_flag: string
pickup_location_id: int32
dropoff_location_id: int32
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
improvement_surcharge: float
total_amount: float
year: int32
month: int32See $metadata for additional Schema metadata
> 
> a <- ds %>% 
+   select(year, total_amount) %>% collect()
> 
> b <- ds %>% 
+   filter(year == 2018) %>% 
+   select(year, total_amount) %>% collect()
> 
> c <- ds %>% 
+   filter(total_amount > 100) %>% 
+   select(year, total_amount) %>% collect(){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)