You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2021/11/07 18:25:00 UTC
[jira] [Comment Edited] (ARROW-14583) [R][C++] Crash when summarizing after filtering to no rows on partitioned data

    [ https://issues.apache.org/jira/browse/ARROW-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440036#comment-17440036 ] 

Nicola Crane edited comment on ARROW-14583 at 11/7/21, 6:24 PM:
----------------------------------------------------------------

I've since been playing around in R and found that even without filtering, I get a crash when doing group_by + summarise on partitioned data, e.g. I get a segfault from the below code

 
{code:java}
library(arrow)
library(dplyr)

write_dataset(group_by(iris, Species), "iris_data")

open_dataset("iris_data") %>%
  group_by(Species) %>%
  summarise(mean(Sepal.Length)) %>%
  collect() {code}
Further experimentation means that it only segfaults if you group by the same variable it's partitioned by when it's been saved - I guess it's still the 0 batches thing happening there.


was (Author: thisisnic):
I've since been playing around in R and found that even without filtering, I get a crash when doing group_by + summarise on partitioned data, e.g. I get a segfault from the below code

 
{code:java}
library(arrow)
library(dplyr)

write_dataset(group_by(iris, Species), "iris_data")

open_dataset("iris_data") %>%
  group_by(Species) %>%
  summarise(mean(Sepal.Length)) %>%
  collect() {code}

> [R][C++] Crash when summarizing after filtering to no rows on partitioned data
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-14583
>                 URL: https://issues.apache.org/jira/browse/ARROW-14583
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 6.0.0
>         Environment: I am using a windows 10 machine, R 4.1.0, up to date R packages, and latest RStudio IDE.
>            Reporter: Zsolt Kegyes-Brassai
>            Assignee: David Li
>            Priority: Major
>              Labels: pull-request-available, query-engine
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Original issue report is below; here's an even more minimal example:
> {code:r}
> library(arrow)
> library(dplyr)
> td <- tempfile()
> dir.create(td)
> # if there is no partitioning in data data, this won't segfault
> # write_dataset(iris, td) - swap this in and won't segfault
> write_dataset(group_by(iris, Species), td)
> open_dataset(td) %>%
>   filter(Species == "tulip") %>%
>   group_by(Sepal.Length) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> ----
> I was trying the new features introduced in latest {{arrow (6.0.2)}} package based on examples from the “New Directions for Apache Arrow” talk.
> The RStudio IDE was crashing and the R session was aborted.
> Looking closely I found that I downloaded only 2 years of data (2018 & 2019) and after the first filter ({{year == 2015}}) no data remains to be processed further.
> After some debugging, by replacing the collect() function, it turns out that the {{summarize()}} is the one which function is causing the crash.
>  
> {code:java}
> as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", 
>                                 partitioning = c("year", "month")) %>%
>   filter(total_amount > 100 & year == 2015) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 5000) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect(){code}
>  
> I would expect to get an error message (without crashing the IDE), which can be handled in code.
> Another alternative result would be an empty data.frame, like in case when the parquet file was read in as a data.frame. I simulated this situation by setting a high {{total_amount}} value when filtering. Note: when using an Arrow table an error message is generated.
>  
> {code:java}
>  library(tidyverse)
> #> Warning: package 'tibble' was built under R version 4.1.1
> #> Warning: package 'tidyr' was built under R version 4.1.1
> #> Warning: package 'readr' was built under R version 4.1.1
> library(arrow)
> #> Warning: package 'arrow' was built under R version 4.1.1
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
>              as_data_frame = FALSE) %>%
>   # filter(total_amount > 100) %>%
>   filter(total_amount > 1e10) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 500) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect()
> #> Error: Invalid: Must pass at least one array
> read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
>              as_data_frame = TRUE) %>%
>   # filter(total_amount > 100) %>%
>   filter(total_amount > 1e10) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 500) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect()
> #> # A tibble: 0 x 3
> #> # ... with 3 variables: passenger_count <int>, avg_tip_pct <dbl>, n <int>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)