You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/07/01 23:55:00 UTC
[jira] [Commented] (ARROW-13169) [R] [C++] sorted partition keys can cause issues

    [ https://issues.apache.org/jira/browse/ARROW-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373105#comment-17373105 ] 

Weston Pace commented on ARROW-13169:
-------------------------------------

Looks like this is a bug in the dataset partitioning / grouper.  The grouper appears to be persisting some kind of static state between instantiations.  I can't find anything though.  I tried a couple of unit tests to reproduce but haven't found what the trick is.  I an only seem to trigger it running the above R script.  If I comment out GrouperFastImpl so it always uses GrouperImpl then everything works correctly.  [~bkietz] [~michalno] can one of you take a look?  If not I can keep digging but I figured I'd check if you had any ideas first.

 

 

 
{code:java}
 {code}
 

 

> [R] [C++] sorted partition keys can cause issues
> ------------------------------------------------
>
>                 Key: ARROW-13169
>                 URL: https://issues.apache.org/jira/browse/ARROW-13169
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>            Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>            Assignee: Nic Crane
>            Priority: Blocker
>             Fix For: 5.0.0
>
>         Attachments: screenshot-1.png
>
>
> _This is a regression after 4.0.1 so is not a live-bug in a release version of arrow_
> When a partition key happens to be ordered, on large (>=1e7 rows), the partitions are not being written faithfully. 
> If the partition isn't ordered or the dataset is smaller than 1e7 the partitions appear to be correct (though we should check that the values in other rows do still match when we test this).
> {code:r}
> library(arrow)
> dir <- "./1M_records"
> n_row <- 1e6
> df <- data.frame(foo = runif(n_row))
> df$let <- sort(sample(letters, n_row, replace = TRUE))
> write_dataset(df, dir, partitioning = "let")
> # this should be 26, corresponding to the number of letters (and is)
> length(list.files(dir))
> #> [1] 26
> dir <- "./10M_records_not_sorted"
> n_row <- 1e7
> df <- data.frame(foo = runif(n_row))
> df$let <- sample(letters, n_row, replace = TRUE)
> write_dataset(df, dir, partitioning = "let")
> # this should be 26, corresponding to the number of letters (and is!)
> length(list.files(dir))
> #> [1] 26
> dir <- "./10M_records"
> n_row <- 1e7
> df <- data.frame(foo = runif(n_row))
> df$let <- sort(sample(letters, n_row, replace = TRUE))
> write_dataset(df, dir, partitioning = "let")
> # this should be 26, corresponding to the number of letters (but is not)
> length(list.files(dir))
> #> [1] 3
> # the letters that were retained:
> list.files(dir)
> #> [1] "let=a" "let=b" "let=c"
> # Oddly(?) all of the rows are here, they have just been reshuffled into one of the letters retained
> nrow(open_dataset(dir))
> #> [1] 10000000
> {code}
> h1. Original report for context:
> A bit of context: the data for this  example contains all the world exports in 1995, it contain 212 countries, but when saving it as parquet, only 66 countries are actually recorded. The verification I included was to check if the USA (one of the best in the reporter quality index) was present in the data.
> {code:r}
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> url <- "https://ams3.digitaloceanspaces.com/uncomtrade/baci_hs92_1995.rds"
> rds <- "baci_hs92_1995.rds"
> if (!file.exists(rds)) try(download.file(url, rds))
> d <- readRDS("baci_hs92_1995.rds")
> rds_has_usa <- any(grepl("usa", unique(d$reporter_iso)))
> rds_has_usa
> #> [1] TRUE
> dir <- "parquet/baci_hs92"
> d %>% 
>   group_by(year, reporter_iso) %>% 
>   write_dataset(dir, hive_style = F)
> parquet_has_usa <- any(grepl("usa", list.files(paste0(dir, "/1995"))))
> parquet_has_usa
> #> [1] FALSE
> {code}
> _Created on 2021-06-24 by the reprex package (https://reprex.tidyverse.org) (v2.0.0)_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)