You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nic Crane (Jira)" <ji...@apache.org> on 2021/07/09 13:37:00 UTC

[jira] [Comment Edited] (ARROW-13293) [R] open_dataset followed by collect hangs (while compute works)

    [ https://issues.apache.org/jira/browse/ARROW-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378070#comment-17378070 ] 

Nic Crane edited comment on ARROW-13293 at 7/9/21, 1:36 PM:
------------------------------------------------------------

Thanks for reporting this.  I'm wondering if this is similar to another issue we've seen before.  Please could you try turning off multithreading using the code below, and seeing if that resolves the issue?
{code:java}
 options(arrow.use_threads = FALSE){code}


was (Author: thisisnic):
Thanks for reporting this.  I'm wondering if this is similar to another issue.  Please could you try turning off multithreading using the code below, and seeing if that resolves the issue?
{code:java}
 options(arrow.use_threads = FALSE){code}

> [R] open_dataset followed by collect hangs (while compute works)
> ----------------------------------------------------------------
>
>                 Key: ARROW-13293
>                 URL: https://issues.apache.org/jira/browse/ARROW-13293
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 4.0.1
>         Environment: Windows 10 (see also session info included in reprex)
>            Reporter: Hans Van Calster
>            Priority: Minor
>
> Tried to make a reproducible example using the iris dataset, but it works as expected for that dataset. So the issue might be specific to the dataset I am using (which contains over 100 columns). The example below illustrates the issue.
> The parquet data used in the example can be downloaded from [this link|https://drive.google.com/file/d/1MHaq3KqlheqrNm8dk71we74n_ip9hMqJ/view?usp=sharing]
>  
> The issue I see is the following:
>  
>  * calling open_dataset() %>% filter() %>% collect() hangs on my machine (while I would expect that a tibble 1,646 x 116 would be returned very fast)
>  * The two alternative calls (one using read_parquet on the specific parquet file within the Dataset on which I filter, and the other using compute() instead of collect()) seem to work as expected
>  
> ``` r
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> read_parquet("data/lucas_harmonised/1_table/parquet_hive/year=2018/part-4.parquet") %>%
>  filter(nuts1 == "BE2")
> #> # A tibble: 1,646 x 116
> #> id point_id nuts0 nuts1 nuts2 nuts3 th_lat th_long office_pi ex_ante
> #> <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> 
> #> 1 199451 39803106 BE BE2 BE22 BE221 51.0 5.14 1 0 
> #> 2 220669 39623116 BE BE2 BE21 BE213 51.0 4.88 1 0 
> #> 3 215557 39483154 BE BE2 BE21 BE211 51.4 4.64 1 0 
> #> 4 223579 40303122 BE BE2 BE22 BE222 51.1 5.84 1 0 
> #> 5 331079 39783134 BE BE2 BE21 BE213 51.2 5.09 0 0 
> #> 6 225417 39403150 BE BE2 BE21 BE211 51.3 4.53 1 0 
> #> 7 3340 38863118 BE BE2 BE23 BE234 51.0 3.79 1 0 
> #> 8 137361 38143132 BE BE2 BE25 BE258 51.1 2.75 1 0 
> #> 9 221861 38343148 BE BE2 BE25 BE255 51.2 3.02 1 0 
> #> 10 787 39523148 BE BE2 BE21 BE211 51.3 4.70 1 0 
> #> # ... with 1,636 more rows, and 106 more variables: survey_date <chr>,
> #> # car_latitude <dbl>, car_ew <chr>, car_longitude <dbl>, gps_proj <chr>,
> #> # gps_prec <int>, gps_altitude <int>, gps_lat <dbl>, gps_ew <chr>,
> #> # gps_long <dbl>, obs_dist <dbl>, obs_direct <chr>, obs_type <chr>,
> #> # obs_radius <chr>, letter_group <chr>, lc1 <chr>, lc1_label <chr>,
> #> # lc1_spec <chr>, lc1_spec_label <chr>, lc1_perc <chr>, lc2 <chr>,
> #> # lc2_label <chr>, lc2_spec <chr>, lc2_spec_label <chr>, lc2_perc <chr>,
> #> # lu1 <chr>, lu1_label <chr>, lu1_type <chr>, lu1_type_label <chr>,
> #> # lu1_perc <chr>, lu2 <chr>, lu2_label <chr>, lu2_type <chr>,
> #> # lu2_type_label <chr>, lu2_perc <chr>, parcel_area_ha <chr>,
> #> # tree_height_maturity <chr>, tree_height_survey <chr>, feature_width <chr>,
> #> # lm_stone_walls <chr>, crop_residues <chr>, lm_grass_margins <chr>,
> #> # grazing <chr>, special_status <chr>, lc_lu_special_remark <chr>,
> #> # cprn_cando <chr>, cprn_lc <chr>, cprn_lc_label <chr>, cprn_lc1n <int>,
> #> # cprnc_lc1e <int>, cprnc_lc1s <int>, cprnc_lc1w <int>,
> #> # cprn_lc1n_brdth <int>, cprn_lc1e_brdth <int>, cprn_lc1s_brdth <int>,
> #> # cprn_lc1w_brdth <int>, cprn_lc1n_next <chr>, cprn_lc1s_next <chr>,
> #> # cprn_lc1e_next <chr>, cprn_lc1w_next <chr>, cprn_urban <chr>,
> #> # cprn_impervious_perc <int>, inspire_plcc1 <int>, inspire_plcc2 <int>,
> #> # inspire_plcc3 <int>, inspire_plcc4 <int>, inspire_plcc5 <int>,
> #> # inspire_plcc6 <int>, inspire_plcc7 <int>, inspire_plcc8 <int>,
> #> # eunis_complex <chr>, grassland_sample <chr>, grass_cando <chr>, wm <chr>,
> #> # wm_source <chr>, wm_type <chr>, wm_delivery <chr>, erosion_cando <chr>,
> #> # soil_stones_perc <chr>, bio_sample <chr>, soil_bio_taken <chr>,
> #> # bulk0_10_sample <chr>, soil_blk_0_10_taken <chr>, bulk10_20_sample <chr>,
> #> # soil_blk_10_20_taken <chr>, bulk20_30_sample <chr>,
> #> # soil_blk_20_30_taken <chr>, standard_sample <chr>, soil_std_taken <chr>,
> #> # organic_sample <chr>, soil_org_depth_cando <chr>, soil_taken <chr>,
> #> # soil_crop <chr>, photo_point <chr>, photo_north <chr>, photo_south <chr>,
> #> # photo_east <chr>, photo_west <chr>, transect <chr>, revisit <int>, ...
> open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>%
>  filter(nuts1 == "BE2", year == 2018) %>%
>  compute() 
> #> Table
> #> 1646 rows x 117 columns
> #> $id <int64>
> #> $point_id <int64>
> #> $nuts0 <string>
> #> $nuts1 <string>
> #> $nuts2 <string>
> #> $nuts3 <string>
> #> $th_lat <double>
> #> $th_long <double>
> #> $office_pi <string>
> #> $ex_ante <string>
> #> $survey_date <string>
> #> $car_latitude <double>
> #> $car_ew <string>
> #> $car_longitude <double>
> #> $gps_proj <string>
> #> $gps_prec <int64>
> #> $gps_altitude <int64>
> #> $gps_lat <double>
> #> $gps_ew <string>
> #> $gps_long <double>
> #> $obs_dist <double>
> #> $obs_direct <string>
> #> $obs_type <string>
> #> $obs_radius <string>
> #> $letter_group <string>
> #> $lc1 <string>
> #> $lc1_label <string>
> #> $lc1_spec <string>
> #> $lc1_spec_label <string>
> #> $lc1_perc <string>
> #> $lc2 <string>
> #> $lc2_label <string>
> #> $lc2_spec <string>
> #> $lc2_spec_label <string>
> #> $lc2_perc <string>
> #> $lu1 <string>
> #> $lu1_label <string>
> #> $lu1_type <string>
> #> $lu1_type_label <string>
> #> $lu1_perc <string>
> #> $lu2 <string>
> #> $lu2_label <string>
> #> $lu2_type <string>
> #> $lu2_type_label <string>
> #> $lu2_perc <string>
> #> $parcel_area_ha <string>
> #> $tree_height_maturity <string>
> #> $tree_height_survey <string>
> #> $feature_width <string>
> #> $lm_stone_walls <string>
> #> $crop_residues <string>
> #> $lm_grass_margins <string>
> #> $grazing <string>
> #> $special_status <string>
> #> $lc_lu_special_remark <string>
> #> $cprn_cando <string>
> #> $cprn_lc <string>
> #> $cprn_lc_label <string>
> #> $cprn_lc1n <int64>
> #> $cprnc_lc1e <int64>
> #> $cprnc_lc1s <int64>
> #> $cprnc_lc1w <int64>
> #> $cprn_lc1n_brdth <int64>
> #> $cprn_lc1e_brdth <int64>
> #> $cprn_lc1s_brdth <int64>
> #> $cprn_lc1w_brdth <int64>
> #> $cprn_lc1n_next <string>
> #> $cprn_lc1s_next <string>
> #> $cprn_lc1e_next <string>
> #> $cprn_lc1w_next <string>
> #> $cprn_urban <string>
> #> $cprn_impervious_perc <int64>
> #> $inspire_plcc1 <int64>
> #> $inspire_plcc2 <int64>
> #> $inspire_plcc3 <int64>
> #> $inspire_plcc4 <int64>
> #> $inspire_plcc5 <int64>
> #> $inspire_plcc6 <int64>
> #> $inspire_plcc7 <int64>
> #> $inspire_plcc8 <int64>
> #> $eunis_complex <string>
> #> $grassland_sample <string>
> #> $grass_cando <string>
> #> $wm <string>
> #> $wm_source <string>
> #> $wm_type <string>
> #> $wm_delivery <string>
> #> $erosion_cando <string>
> #> $soil_stones_perc <string>
> #> $bio_sample <string>
> #> $soil_bio_taken <string>
> #> $bulk0_10_sample <string>
> #> $soil_blk_0_10_taken <string>
> #> $bulk10_20_sample <string>
> #> $soil_blk_10_20_taken <string>
> #> $bulk20_30_sample <string>
> #> $soil_blk_20_30_taken <string>
> #> $standard_sample <string>
> #> $soil_std_taken <string>
> #> $organic_sample <string>
> #> $soil_org_depth_cando <string>
> #> $soil_taken <string>
> #> $soil_crop <string>
> #> $photo_point <string>
> #> $photo_north <string>
> #> $photo_south <string>
> #> $photo_east <string>
> #> $photo_west <string>
> #> $transect <string>
> #> $revisit <int64>
> #> $th_gps_dist <double>
> #> $file_path_gisco_north <string>
> #> $file_path_gisco_south <string>
> #> $file_path_gisco_east <string>
> #> $file_path_gisco_west <string>
> #> $file_path_gisco_point <string>
> #> $year <int32>
> #open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>%
> # filter(nuts1 == "BE2", year == 2018) %>%
> # collect()
> # not run: this will hang
> ```
> <sup>Created on 2021-07-09 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>
> <details style="margin-bottom:10px;">
> <summary>
> Session info
> </summary>
> ``` r
> sessioninfo::session_info()
> #> - Session info ---------------------------------------------------------------
> #> setting value 
> #> version R version 4.1.0 (2021-05-18)
> #> os Windows 10 x64 
> #> system x86_64, mingw32 
> #> ui RTerm 
> #> language (EN) 
> #> collate Dutch_Belgium.1252 
> #> ctype Dutch_Belgium.1252 
> #> tz Europe/Paris 
> #> date 2021-07-09 
> #> 
> #> - Packages -------------------------------------------------------------------
> #> package * version date lib source 
> #> arrow * 4.0.1 2021-05-28 [1] CRAN (R 4.1.0)
> #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
> #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
> #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
> #> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.5)
> #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
> #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
> #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
> #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.0.5)
> #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
> #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
> #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.5)
> #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
> #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
> #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
> #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
> #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
> #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
> #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
> #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
> #> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.1.0)
> #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
> #> ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
> #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
> #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0)
> #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
> #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
> #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.0.5)
> #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
> #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
> #> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.0.5)
> #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
> #> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.1.0)
> #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
> #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.1.0)
> #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
> #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
> #> xfun 0.24 2021-06-15 [1] CRAN (R 4.0.5)
> #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
> #> 
> #> [1] C:/R/library
> #> [2] C:/R/R-4.1.0/library
> ```
> </details>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)