You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/11/05 14:11:00 UTC
[jira] [Commented] (ARROW-14583) [R][C++] Crash when summarizing after filtering to no rows on partitioned data

    [ https://issues.apache.org/jira/browse/ARROW-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439291#comment-17439291 ] 

David Li commented on ARROW-14583:
----------------------------------

The crash is because if GroupByNode gets 0 batches, it never initializes its internal state. The "Invalid" error is because Take() on a ChunkedArray assumes there must be 1 or more than 1 chunk, and fails to account for 0 chunks (and so as Weston guessed, is trying to concatenate 0 chunks).

> [R][C++] Crash when summarizing after filtering to no rows on partitioned data
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-14583
>                 URL: https://issues.apache.org/jira/browse/ARROW-14583
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 6.0.0
>         Environment: I am using a windows 10 machine, R 4.1.0, up to date R packages, and latest RStudio IDE.
>            Reporter: Zsolt Kegyes-Brassai
>            Assignee: David Li
>            Priority: Major
>              Labels: query-engine
>
> Original issue report is below; here's an even more minimal example:
> {code:r}
> library(arrow)
> library(dplyr)
> td <- tempfile()
> dir.create(td)
> # if there is no partitioning in data data, this won't segfault
> # write_dataset(iris, td) - swap this in and won't segfault
> write_dataset(group_by(iris, Species), td)
> open_dataset(td) %>%
>   filter(Species == "tulip") %>%
>   group_by(Sepal.Length) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> ----
> I was trying the new features introduced in latest {{arrow (6.0.2)}} package based on examples from the “New Directions for Apache Arrow” talk.
> The RStudio IDE was crashing and the R session was aborted.
> Looking closely I found that I downloaded only 2 years of data (2018 & 2019) and after the first filter ({{year == 2015}}) no data remains to be processed further.
> After some debugging, by replacing the collect() function, it turns out that the {{summarize()}} is the one which function is causing the crash.
>  
> {code:java}
> as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", 
>                                 partitioning = c("year", "month")) %>%
>   filter(total_amount > 100 & year == 2015) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 5000) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect(){code}
>  
> I would expect to get an error message (without crashing the IDE), which can be handled in code.
> Another alternative result would be an empty data.frame, like in case when the parquet file was read in as a data.frame. I simulated this situation by setting a high {{total_amount}} value when filtering. Note: when using an Arrow table an error message is generated.
>  
> {code:java}
>  library(tidyverse)
> #> Warning: package 'tibble' was built under R version 4.1.1
> #> Warning: package 'tidyr' was built under R version 4.1.1
> #> Warning: package 'readr' was built under R version 4.1.1
> library(arrow)
> #> Warning: package 'arrow' was built under R version 4.1.1
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
>              as_data_frame = FALSE) %>%
>   # filter(total_amount > 100) %>%
>   filter(total_amount > 1e10) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 500) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect()
> #> Error: Invalid: Must pass at least one array
> read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
>              as_data_frame = TRUE) %>%
>   # filter(total_amount > 100) %>%
>   filter(total_amount > 1e10) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 500) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect()
> #> # A tibble: 0 x 3
> #> # ... with 3 variables: passenger_count <int>, avg_tip_pct <dbl>, n <int>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)