You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/09/29 18:45:00 UTC
[jira] [Resolved] (ARROW-13293) [R] open_dataset followed by collect hangs (while compute works)

     [ https://issues.apache.org/jira/browse/ARROW-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson resolved ARROW-13293.
-------------------------------------
    Fix Version/s: 6.0.0
         Assignee: Neal Richardson
       Resolution: Fixed

We believe that this has been resolved in ARROW-8379. If you still experience this with version 6.0.0 or greater (after it is released in mid-October), please open a new issue.

> [R] open_dataset followed by collect hangs (while compute works)
> ----------------------------------------------------------------
>
>                 Key: ARROW-13293
>                 URL: https://issues.apache.org/jira/browse/ARROW-13293
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 4.0.1
>         Environment: Windows 10 (see also session info included in reprex)
>            Reporter: Hans Van Calster
>            Assignee: Neal Richardson
>            Priority: Minor
>             Fix For: 6.0.0
>
>
> Tried to make a reproducible example using the iris dataset, but it works as expected for that dataset. So the issue might be specific to the dataset I am using (which contains over 100 columns). The example below illustrates the issue.
> The parquet data used in the example can be downloaded from [this link|https://drive.google.com/file/d/1MHaq3KqlheqrNm8dk71we74n_ip9hMqJ/view?usp=sharing]
>  
> The issue I see is the following:
>  
>  * calling open_dataset() %>% filter() %>% collect() hangs on my machine (while I would expect that a tibble 1,646 x 116 would be returned very fast)
>  * The two alternative calls (one using read_parquet on the specific parquet file within the Dataset on which I filter, and the other using compute() instead of collect()) seem to work as expected
>  
> ``` r
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> read_parquet("data/lucas_harmonised/1_table/parquet_hive/year=2018/part-4.parquet") %>%
>  filter(nuts1 == "BE2")
> #> # A tibble: 1,646 x 116
> #> id point_id nuts0 nuts1 nuts2 nuts3 th_lat th_long office_pi ex_ante
> #> <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> 
> #> 1 199451 39803106 BE BE2 BE22 BE221 51.0 5.14 1 0 
> #> 2 220669 39623116 BE BE2 BE21 BE213 51.0 4.88 1 0 
> #> 3 215557 39483154 BE BE2 BE21 BE211 51.4 4.64 1 0 
> #> 4 223579 40303122 BE BE2 BE22 BE222 51.1 5.84 1 0 
> #> 5 331079 39783134 BE BE2 BE21 BE213 51.2 5.09 0 0 
> #> 6 225417 39403150 BE BE2 BE21 BE211 51.3 4.53 1 0 
> #> 7 3340 38863118 BE BE2 BE23 BE234 51.0 3.79 1 0 
> #> 8 137361 38143132 BE BE2 BE25 BE258 51.1 2.75 1 0 
> #> 9 221861 38343148 BE BE2 BE25 BE255 51.2 3.02 1 0 
> #> 10 787 39523148 BE BE2 BE21 BE211 51.3 4.70 1 0 
> #> # ... with 1,636 more rows, and 106 more variables: survey_date <chr>,
> #> # car_latitude <dbl>, car_ew <chr>, car_longitude <dbl>, gps_proj <chr>,
> #> # gps_prec <int>, gps_altitude <int>, gps_lat <dbl>, gps_ew <chr>,
> #> # gps_long <dbl>, obs_dist <dbl>, obs_direct <chr>, obs_type <chr>,
> #> # obs_radius <chr>, letter_group <chr>, lc1 <chr>, lc1_label <chr>,
> #> # lc1_spec <chr>, lc1_spec_label <chr>, lc1_perc <chr>, lc2 <chr>,
> #> # lc2_label <chr>, lc2_spec <chr>, lc2_spec_label <chr>, lc2_perc <chr>,
> #> # lu1 <chr>, lu1_label <chr>, lu1_type <chr>, lu1_type_label <chr>,
> #> # lu1_perc <chr>, lu2 <chr>, lu2_label <chr>, lu2_type <chr>,
> #> # lu2_type_label <chr>, lu2_perc <chr>, parcel_area_ha <chr>,
> #> # tree_height_maturity <chr>, tree_height_survey <chr>, feature_width <chr>,
> #> # lm_stone_walls <chr>, crop_residues <chr>, lm_grass_margins <chr>,
> #> # grazing <chr>, special_status <chr>, lc_lu_special_remark <chr>,
> #> # cprn_cando <chr>, cprn_lc <chr>, cprn_lc_label <chr>, cprn_lc1n <int>,
> #> # cprnc_lc1e <int>, cprnc_lc1s <int>, cprnc_lc1w <int>,
> #> # cprn_lc1n_brdth <int>, cprn_lc1e_brdth <int>, cprn_lc1s_brdth <int>,
> #> # cprn_lc1w_brdth <int>, cprn_lc1n_next <chr>, cprn_lc1s_next <chr>,
> #> # cprn_lc1e_next <chr>, cprn_lc1w_next <chr>, cprn_urban <chr>,
> #> # cprn_impervious_perc <int>, inspire_plcc1 <int>, inspire_plcc2 <int>,
> #> # inspire_plcc3 <int>, inspire_plcc4 <int>, inspire_plcc5 <int>,
> #> # inspire_plcc6 <int>, inspire_plcc7 <int>, inspire_plcc8 <int>,
> #> # eunis_complex <chr>, grassland_sample <chr>, grass_cando <chr>, wm <chr>,
> #> # wm_source <chr>, wm_type <chr>, wm_delivery <chr>, erosion_cando <chr>,
> #> # soil_stones_perc <chr>, bio_sample <chr>, soil_bio_taken <chr>,
> #> # bulk0_10_sample <chr>, soil_blk_0_10_taken <chr>, bulk10_20_sample <chr>,
> #> # soil_blk_10_20_taken <chr>, bulk20_30_sample <chr>,
> #> # soil_blk_20_30_taken <chr>, standard_sample <chr>, soil_std_taken <chr>,
> #> # organic_sample <chr>, soil_org_depth_cando <chr>, soil_taken <chr>,
> #> # soil_crop <chr>, photo_point <chr>, photo_north <chr>, photo_south <chr>,
> #> # photo_east <chr>, photo_west <chr>, transect <chr>, revisit <int>, ...
> open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>%
>  filter(nuts1 == "BE2", year == 2018) %>%
>  compute() 
> #> Table
> #> 1646 rows x 117 columns
> #> $id <int64>
> #> $point_id <int64>
> #> $nuts0 <string>
> #> $nuts1 <string>
> #> $nuts2 <string>
> #> $nuts3 <string>
> #> $th_lat <double>
> #> $th_long <double>
> #> $office_pi <string>
> #> $ex_ante <string>
> #> $survey_date <string>
> #> $car_latitude <double>
> #> $car_ew <string>
> #> $car_longitude <double>
> #> $gps_proj <string>
> #> $gps_prec <int64>
> #> $gps_altitude <int64>
> #> $gps_lat <double>
> #> $gps_ew <string>
> #> $gps_long <double>
> #> $obs_dist <double>
> #> $obs_direct <string>
> #> $obs_type <string>
> #> $obs_radius <string>
> #> $letter_group <string>
> #> $lc1 <string>
> #> $lc1_label <string>
> #> $lc1_spec <string>
> #> $lc1_spec_label <string>
> #> $lc1_perc <string>
> #> $lc2 <string>
> #> $lc2_label <string>
> #> $lc2_spec <string>
> #> $lc2_spec_label <string>
> #> $lc2_perc <string>
> #> $lu1 <string>
> #> $lu1_label <string>
> #> $lu1_type <string>
> #> $lu1_type_label <string>
> #> $lu1_perc <string>
> #> $lu2 <string>
> #> $lu2_label <string>
> #> $lu2_type <string>
> #> $lu2_type_label <string>
> #> $lu2_perc <string>
> #> $parcel_area_ha <string>
> #> $tree_height_maturity <string>
> #> $tree_height_survey <string>
> #> $feature_width <string>
> #> $lm_stone_walls <string>
> #> $crop_residues <string>
> #> $lm_grass_margins <string>
> #> $grazing <string>
> #> $special_status <string>
> #> $lc_lu_special_remark <string>
> #> $cprn_cando <string>
> #> $cprn_lc <string>
> #> $cprn_lc_label <string>
> #> $cprn_lc1n <int64>
> #> $cprnc_lc1e <int64>
> #> $cprnc_lc1s <int64>
> #> $cprnc_lc1w <int64>
> #> $cprn_lc1n_brdth <int64>
> #> $cprn_lc1e_brdth <int64>
> #> $cprn_lc1s_brdth <int64>
> #> $cprn_lc1w_brdth <int64>
> #> $cprn_lc1n_next <string>
> #> $cprn_lc1s_next <string>
> #> $cprn_lc1e_next <string>
> #> $cprn_lc1w_next <string>
> #> $cprn_urban <string>
> #> $cprn_impervious_perc <int64>
> #> $inspire_plcc1 <int64>
> #> $inspire_plcc2 <int64>
> #> $inspire_plcc3 <int64>
> #> $inspire_plcc4 <int64>
> #> $inspire_plcc5 <int64>
> #> $inspire_plcc6 <int64>
> #> $inspire_plcc7 <int64>
> #> $inspire_plcc8 <int64>
> #> $eunis_complex <string>
> #> $grassland_sample <string>
> #> $grass_cando <string>
> #> $wm <string>
> #> $wm_source <string>
> #> $wm_type <string>
> #> $wm_delivery <string>
> #> $erosion_cando <string>
> #> $soil_stones_perc <string>
> #> $bio_sample <string>
> #> $soil_bio_taken <string>
> #> $bulk0_10_sample <string>
> #> $soil_blk_0_10_taken <string>
> #> $bulk10_20_sample <string>
> #> $soil_blk_10_20_taken <string>
> #> $bulk20_30_sample <string>
> #> $soil_blk_20_30_taken <string>
> #> $standard_sample <string>
> #> $soil_std_taken <string>
> #> $organic_sample <string>
> #> $soil_org_depth_cando <string>
> #> $soil_taken <string>
> #> $soil_crop <string>
> #> $photo_point <string>
> #> $photo_north <string>
> #> $photo_south <string>
> #> $photo_east <string>
> #> $photo_west <string>
> #> $transect <string>
> #> $revisit <int64>
> #> $th_gps_dist <double>
> #> $file_path_gisco_north <string>
> #> $file_path_gisco_south <string>
> #> $file_path_gisco_east <string>
> #> $file_path_gisco_west <string>
> #> $file_path_gisco_point <string>
> #> $year <int32>
> #open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>%
> # filter(nuts1 == "BE2", year == 2018) %>%
> # collect()
> # not run: this will hang
> ```
> <sup>Created on 2021-07-09 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>
> <details style="margin-bottom:10px;">
> <summary>
> Session info
> </summary>
> ``` r
> sessioninfo::session_info()
> #> - Session info ---------------------------------------------------------------
> #> setting value 
> #> version R version 4.1.0 (2021-05-18)
> #> os Windows 10 x64 
> #> system x86_64, mingw32 
> #> ui RTerm 
> #> language (EN) 
> #> collate Dutch_Belgium.1252 
> #> ctype Dutch_Belgium.1252 
> #> tz Europe/Paris 
> #> date 2021-07-09 
> #> 
> #> - Packages -------------------------------------------------------------------
> #> package * version date lib source 
> #> arrow * 4.0.1 2021-05-28 [1] CRAN (R 4.1.0)
> #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
> #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
> #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
> #> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.5)
> #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
> #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
> #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
> #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.0.5)
> #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
> #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
> #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.5)
> #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
> #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
> #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
> #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
> #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
> #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
> #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
> #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
> #> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.1.0)
> #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
> #> ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
> #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
> #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0)
> #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
> #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
> #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.0.5)
> #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
> #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
> #> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.0.5)
> #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
> #> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.1.0)
> #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
> #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.1.0)
> #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
> #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
> #> xfun 0.24 2021-06-15 [1] CRAN (R 4.0.5)
> #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
> #> 
> #> [1] C:/R/library
> #> [2] C:/R/R-4.1.0/library
> ```
> </details>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)