You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/02/25 16:22:00 UTC
[jira] [Resolved] (ARROW-11756) [R] passing a partition as a schema leads to segfaults

     [ https://issues.apache.org/jira/browse/ARROW-11756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson resolved ARROW-11756.
-------------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 9566
[https://github.com/apache/arrow/pull/9566]

> [R] passing a partition as a schema leads to segfaults
> ------------------------------------------------------
>
>                 Key: ARROW-11756
>                 URL: https://issues.apache.org/jira/browse/ARROW-11756
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Jonathan Keane
>            Assignee: Mauricio 'Pachá' Vargas Sepúlveda
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> [The command to open a dataset in R|https://arrow.apache.org/docs/r/reference/open_dataset.html] can accept both a schema and a partitioning argument. If one accidentally passes a partitioning as the schema, the result looks like the dataset was read, but operating on the dataset results in segfaults after.
> Though this is input error, we should add a validation checking that the schema argument is, in fact, a {{Schema}} object and error if it is not so that someone doesn't find themselves confronted with a segfault later.
> {code:r}
> ### begin setup 
> # note: this exact code is called in test-dataset.R lines 18-87) So when adding
> # the test to that file, you don't need to copy this, but can use the code at
> # the bottom of this chunk in that test if you want.
> library(dplyr)
> make_temp_dir <- function() {
>   path <- tempfile()
>   dir.create(path)
>   normalizePath(path, winslash = "/")
> }
> hive_dir <- make_temp_dir()
> first_date <- lubridate::ymd_hms("2015-04-29 03:12:39")
> df1 <- tibble(
>   int = 1:10,
>   dbl = as.numeric(1:10),
>   lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
>   chr = letters[1:10],
>   fct = factor(LETTERS[1:10]),
>   ts = first_date + lubridate::days(1:10)
> )
> second_date <- lubridate::ymd_hms("2017-03-09 07:01:02")
> df2 <- tibble(
>   int = 101:110,
>   dbl = c(as.numeric(51:59), NaN),
>   lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
>   chr = letters[10:1],
>   fct = factor(LETTERS[10:1]),
>   ts = second_date + lubridate::days(10:1)
> )
> dir.create(file.path(hive_dir, "subdir", "group=1", "other=xxx"), recursive = TRUE)
> dir.create(file.path(hive_dir, "subdir", "group=2", "other=yyy"), recursive = TRUE)
> write_parquet(df1, file.path(hive_dir, "subdir", "group=1", "other=xxx", "file1.parquet"))
> write_parquet(df2, file.path(hive_dir, "subdir", "group=2", "other=yyy", "file2.parquet"))
> ### end setup
> # This (the correct specification) works just fine
> ds <- open_dataset(hive_dir, partitioning = hive_partition(other = utf8(), group = uint8()))
> ds$schema
> # But if you aren't explicit with ther argument names it looks like everything works...
> ds <- open_dataset(hive_dir, hive_partition(other = utf8(), group = uint8()))
> # but the dataset is malformed and will have segfaults when trying to interact with it for example:
> ds$schema
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)