You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2021/11/07 18:25:00 UTC
[jira] [Comment Edited] (ARROW-14583) [R][C++] Crash when
summarizing after filtering to no rows on partitioned data
[ https://issues.apache.org/jira/browse/ARROW-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440036#comment-17440036 ]
Nicola Crane edited comment on ARROW-14583 at 11/7/21, 6:24 PM:
----------------------------------------------------------------
I've since been playing around in R and found that even without filtering, I get a crash when doing group_by + summarise on partitioned data, e.g. I get a segfault from the below code
{code:java}
library(arrow)
library(dplyr)
write_dataset(group_by(iris, Species), "iris_data")
open_dataset("iris_data") %>%
group_by(Species) %>%
summarise(mean(Sepal.Length)) %>%
collect() {code}
Further experimentation means that it only segfaults if you group by the same variable it's partitioned by when it's been saved - I guess it's still the 0 batches thing happening there.
was (Author: thisisnic):
I've since been playing around in R and found that even without filtering, I get a crash when doing group_by + summarise on partitioned data, e.g. I get a segfault from the below code
{code:java}
library(arrow)
library(dplyr)
write_dataset(group_by(iris, Species), "iris_data")
open_dataset("iris_data") %>%
group_by(Species) %>%
summarise(mean(Sepal.Length)) %>%
collect() {code}
> [R][C++] Crash when summarizing after filtering to no rows on partitioned data
> ------------------------------------------------------------------------------
>
> Key: ARROW-14583
> URL: https://issues.apache.org/jira/browse/ARROW-14583
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Affects Versions: 6.0.0
> Environment: I am using a windows 10 machine, R 4.1.0, up to date R packages, and latest RStudio IDE.
> Reporter: Zsolt Kegyes-Brassai
> Assignee: David Li
> Priority: Major
> Labels: pull-request-available, query-engine
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Original issue report is below; here's an even more minimal example:
> {code:r}
> library(arrow)
> library(dplyr)
> td <- tempfile()
> dir.create(td)
> # if there is no partitioning in data data, this won't segfault
> # write_dataset(iris, td) - swap this in and won't segfault
> write_dataset(group_by(iris, Species), td)
> open_dataset(td) %>%
> filter(Species == "tulip") %>%
> group_by(Sepal.Length) %>%
> summarise(n = n()) %>%
> collect()
> {code}
> ----
> I was trying the new features introduced in latest {{arrow (6.0.2)}} package based on examples from the “New Directions for Apache Arrow” talk.
> The RStudio IDE was crashing and the R session was aborted.
> Looking closely I found that I downloaded only 2 years of data (2018 & 2019) and after the first filter ({{year == 2015}}) no data remains to be processed further.
> After some debugging, by replacing the collect() function, it turns out that the {{summarize()}} is the one which function is causing the crash.
>
> {code:java}
> as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/",
> partitioning = c("year", "month")) %>%
> filter(total_amount > 100 & year == 2015) %>%
> select(tip_amount, total_amount, passenger_count) %>%
> mutate(tip_pct = tip_amount / total_amount * 100) %>%
> group_by(passenger_count) %>%
> summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
> filter(n > 5000) %>%
> arrange(desc(avg_tip_pct)) %>%
> collect(){code}
>
> I would expect to get an error message (without crashing the IDE), which can be handled in code.
> Another alternative result would be an empty data.frame, like in case when the parquet file was read in as a data.frame. I simulated this situation by setting a high {{total_amount}} value when filtering. Note: when using an Arrow table an error message is generated.
>
> {code:java}
> library(tidyverse)
> #> Warning: package 'tibble' was built under R version 4.1.1
> #> Warning: package 'tidyr' was built under R version 4.1.1
> #> Warning: package 'readr' was built under R version 4.1.1
> library(arrow)
> #> Warning: package 'arrow' was built under R version 4.1.1
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #> timestamp
> read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet",
> as_data_frame = FALSE) %>%
> # filter(total_amount > 100) %>%
> filter(total_amount > 1e10) %>%
> select(tip_amount, total_amount, passenger_count) %>%
> mutate(tip_pct = tip_amount / total_amount * 100) %>%
> group_by(passenger_count) %>%
> summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
> filter(n > 500) %>%
> arrange(desc(avg_tip_pct)) %>%
> collect()
> #> Error: Invalid: Must pass at least one array
> read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet",
> as_data_frame = TRUE) %>%
> # filter(total_amount > 100) %>%
> filter(total_amount > 1e10) %>%
> select(tip_amount, total_amount, passenger_count) %>%
> mutate(tip_pct = tip_amount / total_amount * 100) %>%
> group_by(passenger_count) %>%
> summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
> filter(n > 500) %>%
> arrange(desc(avg_tip_pct)) %>%
> collect()
> #> # A tibble: 0 x 3
> #> # ... with 3 variables: passenger_count <int>, avg_tip_pct <dbl>, n <int>
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)