You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/11/04 18:43:00 UTC

[jira] [Updated] (ARROW-14583) [R][C++] Crash when summarizing after filtering to no rows

     [ https://issues.apache.org/jira/browse/ARROW-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weston Pace updated ARROW-14583:
--------------------------------
    Labels: query-engine  (was: )

> [R][C++] Crash when summarizing after filtering to no rows
> ----------------------------------------------------------
>
>                 Key: ARROW-14583
>                 URL: https://issues.apache.org/jira/browse/ARROW-14583
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 6.0.0
>         Environment: I am using a windows 10 machine, R 4.1.0, up to date R packages, and latest RStudio IDE.
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Major
>              Labels: query-engine
>
> I was trying the new features introduced in latest {{arrow (6.0.2)}} package based on examples from the “New Directions for Apache Arrow” talk.
> The RStudio IDE was crashing and the R session was aborted.
> Looking closely I found that I downloaded only 2 years of data (2018 & 2019) and after the first filter ({{year == 2015}}) no data remains to be processed further.
> After some debugging, by replacing the collect() function, it turns out that the {{summarize()}} is the one which function is causing the crash.
>  
> {code:java}
> as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", 
>                                 partitioning = c("year", "month")) %>%
>   filter(total_amount > 100 & year == 2015) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 5000) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect(){code}
>  
> I would expect to get an error message (without crashing the IDE), which can be handled in code.
> Another alternative result would be an empty data.frame, like in case when the parquet file was read in as a data.frame. I simulated this situation by setting a high {{total_amount}} value when filtering. Note: when using an Arrow table an error message is generated.
>  
> {code:java}
>  library(tidyverse)
> #> Warning: package 'tibble' was built under R version 4.1.1
> #> Warning: package 'tidyr' was built under R version 4.1.1
> #> Warning: package 'readr' was built under R version 4.1.1
> library(arrow)
> #> Warning: package 'arrow' was built under R version 4.1.1
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
>              as_data_frame = FALSE) %>%
>   # filter(total_amount > 100) %>%
>   filter(total_amount > 1e10) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 500) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect()
> #> Error: Invalid: Must pass at least one array
> read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
>              as_data_frame = TRUE) %>%
>   # filter(total_amount > 100) %>%
>   filter(total_amount > 1e10) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 500) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect()
> #> # A tibble: 0 x 3
> #> # ... with 3 variables: passenger_count <int>, avg_tip_pct <dbl>, n <int>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)