You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/01/29 16:23:00 UTC
[jira] [Updated] (ARROW-11413) [R] dplyr filter is not working for
datasets
[ https://issues.apache.org/jira/browse/ARROW-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Keane updated ARROW-11413:
-----------------------------------
Summary: [R] dplyr filter is not working for datasets (was: dplyr filter is not working for datasets )
> [R] dplyr filter is not working for datasets
> ---------------------------------------------
>
> Key: ARROW-11413
> URL: https://issues.apache.org/jira/browse/ARROW-11413
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 2.0.0, 3.0.0
> Environment: i7, windows 10 laptop
> Reporter: Zsolt Kegyes-Brassai
> Priority: Minor
>
> I was trying to recreate the [vignette|https://arrow.apache.org/docs/r/articles/dataset.html] on datasets and dplyr on a win10 machine. I downloaded the data for 2 consecutive years (2017, 2018) to my laptop.
> The filter is working only for variables used for partitioning. When I am inserting any other variable (like the total_amount) the R/RStudio session hangs: no error message and more interestingly no detectable CPU load nor disk usage (task manager) for many minutes.
> I experienced the same issue both with arrow 2.0.0 and 3.0.0 (just I update my R packages this morning). Previously, I already tried to reinstall the arrow 2.0.0 package.
> Did I misunderstand something in the vignette? Is there any OS limitation?
>
> {code:java}
> //
> > library(arrow)Attaching package: 'arrow'The following object is masked from 'package:utils': timestamp> library(tidyverse)
> -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.3.0 --
> v ggplot2 3.3.3 v purrr 0.3.4
> v tibble 3.0.5 v dplyr 1.0.3
> v tidyr 1.1.2 v stringr 1.4.0
> v readr 1.4.0 v forcats 0.5.1
> -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
> x dplyr::filter() masks stats::filter()
> x dplyr::lag() masks stats::lag()
> > arrow_available()
> [1] TRUE
> > arrow_info()
> Arrow package version: 3.0.0Capabilities:
>
> s3 TRUE
> snappy TRUE
> gzip TRUE
> brotli FALSE
> zstd TRUE
> lz4 TRUE
> lz4_frame TRUE
> lzo FALSE
> bz2 FALSE
> jemalloc FALSE
> mimalloc TRUEMemory:
>
> Allocator mimalloc
> Current 0 bytes
> Max 0 bytes>
> > ds <- open_dataset(taxidir, partitioning = c("year", "month"))
> > ds
> FileSystemDataset with 24 Parquet files
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> rate_code_id: string
> store_and_fwd_flag: string
> pickup_location_id: int32
> dropoff_location_id: int32
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> improvement_surcharge: float
> total_amount: float
> year: int32
> month: int32See $metadata for additional Schema metadata
> >
> > a <- ds %>%
> + select(year, total_amount) %>% collect()
> >
> > b <- ds %>%
> + filter(year == 2018) %>%
> + select(year, total_amount) %>% collect()
> >
> > c <- ds %>%
> + filter(total_amount > 100) %>%
> + select(year, total_amount) %>% collect(){code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)