You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Mauricio 'Pachá' Vargas Sepúlveda (Jira)" <ji...@apache.org> on 2021/06/24 17:42:00 UTC

[jira] [Updated] (ARROW-13169) [R] group_by + write_dataset skips some countries with UN COMTRADE / BACI datasets

     [ https://issues.apache.org/jira/browse/ARROW-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mauricio 'Pachá' Vargas Sepúlveda updated ARROW-13169:
------------------------------------------------------
    Description: 
A bit of context: the data for this  example contains all the world exports in 1995, it contain 212 countries, but when saving it as parquet, only 66 countries are actually recorded.

{code:r}
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

url <- "https://ams3.digitaloceanspaces.com/uncomtrade/baci_hs92_1995.rds"
rds <- "baci_hs92_1995.rds"

if (!file.exists(rds)) try(download.file(url, rds))

d <- readRDS("baci_hs92_1995.rds")

rds_has_usa <- any(grepl("usa", unique(d$reporter_iso)))
rds_has_usa
#> [1] TRUE

dir <- "parquet/baci_hs92"

d %>% 
  group_by(year, reporter_iso) %>% 
  write_dataset(dir, hive_style = F)

parquet_has_usa <- any(grepl("usa", list.files(paste0(dir, "/1995"))))
parquet_has_usa
#> [1] FALSE
{code}

_Created on 2021-06-24 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)_


  was:
``` r
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

url <- "https://ams3.digitaloceanspaces.com/uncomtrade/baci_hs92_1995.rds"
rds <- "baci_hs92_1995.rds"

if (!file.exists(rds)) try(download.file(url, rds))

d <- readRDS("baci_hs92_1995.rds")

rds_has_usa <- any(grepl("usa", unique(d$reporter_iso)))
rds_has_usa
#> [1] TRUE

dir <- "parquet/baci_hs92"

d %>% 
  group_by(year, reporter_iso) %>% 
  write_dataset(dir, hive_style = F)

parquet_has_usa <- any(grepl("usa", list.files(paste0(dir, "/1995"))))
parquet_has_usa
#> [1] FALSE
```

<sup>Created on 2021-06-24 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>



> [R] group_by + write_dataset skips some countries with UN COMTRADE / BACI datasets
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-13169
>                 URL: https://issues.apache.org/jira/browse/ARROW-13169
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 4.0.1
>            Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>            Priority: Major
>             Fix For: 5.0.0
>
>
> A bit of context: the data for this  example contains all the world exports in 1995, it contain 212 countries, but when saving it as parquet, only 66 countries are actually recorded.
> {code:r}
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> url <- "https://ams3.digitaloceanspaces.com/uncomtrade/baci_hs92_1995.rds"
> rds <- "baci_hs92_1995.rds"
> if (!file.exists(rds)) try(download.file(url, rds))
> d <- readRDS("baci_hs92_1995.rds")
> rds_has_usa <- any(grepl("usa", unique(d$reporter_iso)))
> rds_has_usa
> #> [1] TRUE
> dir <- "parquet/baci_hs92"
> d %>% 
>   group_by(year, reporter_iso) %>% 
>   write_dataset(dir, hive_style = F)
> parquet_has_usa <- any(grepl("usa", list.files(paste0(dir, "/1995"))))
> parquet_has_usa
> #> [1] FALSE
> {code}
> _Created on 2021-06-24 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)