You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Zsolt Kegyes-Brassai (Jira)" <ji...@apache.org> on 2021/11/04 07:06:00 UTC

[jira] [Created] (ARROW-14583) RStudio IDE crash

Zsolt Kegyes-Brassai created ARROW-14583:
--------------------------------------------

             Summary: RStudio IDE crash
                 Key: ARROW-14583
                 URL: https://issues.apache.org/jira/browse/ARROW-14583
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 6.0.0
         Environment: I am using a windows 10 machine, R 4.1.0, up to date R packages, and latest RStudio IDE.
            Reporter: Zsolt Kegyes-Brassai


I was trying the new features introduced in latest {{arrow (6.0.2)}} package based on examples from the “New Directions for Apache Arrow” talk.

The RStudio IDE was crashing and the R session was aborted.

Looking closely I found that I downloaded only 2 years of data (2018 & 2019) and after the first filter ({{year == 2015}}) no data remains to be processed further.

After some debugging, by replacing the collect() function, it turns out that the {{summarize()}} is the one which function is causing the crash.

 
{code:java}
as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", 
                                partitioning = c("year", "month")) %>%
  filter(total_amount > 100 & year == 2015) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 5000) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect(){code}
 

I would expect to get an error message (without crashing the IDE), which can be handled in code.

Another alternative result would be an empty data.frame, like in case when the parquet file was read in as a data.frame. I simulated this situation by setting a high {{total_amount}} value when filtering. Note: when using an Arrow table an error message is generated.

 
{code:java}
 library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.1
#> Warning: package 'tidyr' was built under R version 4.1.1
#> Warning: package 'readr' was built under R version 4.1.1
library(arrow)
#> Warning: package 'arrow' was built under R version 4.1.1
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
             as_data_frame = FALSE) %>%
  # filter(total_amount > 100) %>%
  filter(total_amount > 1e10) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 500) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect()

#> Error: Invalid: Must pass at least one array


read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
             as_data_frame = TRUE) %>%
  # filter(total_amount > 100) %>%
  filter(total_amount > 1e10) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 500) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect()

#> # A tibble: 0 x 3
#> # ... with 3 variables: passenger_count <int>, avg_tip_pct <dbl>, n <int>
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)