You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "xtimbeau (via GitHub)" <gi...@apache.org> on 2024/03/18 08:24:23 UTC

[I] Collect crashes on R when partioning col is in parquet files and in subfolder names [arrow]

xtimbeau opened a new issue, #40624:
URL: https://github.com/apache/arrow/issues/40624

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Building datasets from folders, subfolders and parquet files can make R arrow crashes:
   
   Here is the reprex (run the 3 tests separatly):
   
   ```r
   # test one hive field in parquet and folder name crashes --------
   library(tidyverse) 
   data <- tibble(
     x = 1:10,
     g = floor(0:9/5))
   if(fs::file_exists("/tmp/test")) fs::dir_delete("/tmp/test")
   # building subfolders
   walk(0:1, ~{
     dd <- data |> filter(g==.x)
     fn <- str_c("/tmp/test/g=", .x, "/part-0.parquet")
     fs::dir_create(str_c("/tmp/test/g=", .x))
     arrow::write_parquet(dd, fn)})
   arrow::open_dataset("/tmp/test/g=1") |> collect() # ok
   arrow::open_dataset("/tmp/test/g=0") |> collect() # ok
   arrow::open_dataset("/tmp/test/")  |> collect()  # crashes
   
   # test two hive field in parquet and not in folder name ok --------
   library(tidyverse) 
   data <- tibble(
     x = 1:10,
     g = floor(0:9/5))
   if(fs::file_exists("/tmp/test")) fs::dir_delete("/tmp/test")
   walk(0:1, ~{
     dd <- data |> filter(g==.x)
     fn <- str_c("/tmp/test/", .x, "/part-0.parquet")
     fs::dir_create(str_c("/tmp/test/", .x))
     arrow::write_parquet(dd, fn)})
   arrow::open_dataset("/tmp/test/1") |> collect() # pass
   arrow::open_dataset("/tmp/test/0") |> collect() # pass
   arrow::open_dataset("/tmp/test/")  |> collect()  # pass
   
   # test three hive field not in parquet but in folder name ok --------
   library(tidyverse) 
   data <- tibble(
     x = 1:10,
     g = floor(0:9/5))
   if(fs::file_exists("/tmp/test")) fs::dir_delete("/tmp/test")
   walk(0:1, ~{
     dd <- data |> filter(g==.x) |> select(-g)
     fn <- str_c("/tmp/test/g=", .x, "/part-0.parquet")
     fs::dir_create(str_c("/tmp/test/g=", .x))
     arrow::write_parquet(dd, fn)})
   arrow::open_dataset("/tmp/test/g=1") |> collect() # pass
   arrow::open_dataset("/tmp/test/g=0") |> collect() # pass
   arrow::open_dataset("/tmp/test/")  |> collect() # pass
   ```
   
   Using arrow 15.0.1, MacOS 14.4, RStudio, R 4.3.3
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [R] Collect crashes on R when partioning col is in parquet files and in subfolder names [arrow]

Posted by "amoeba (via GitHub)" <gi...@apache.org>.

amoeba commented on issue #40624:
URL: https://github.com/apache/arrow/issues/40624#issuecomment-2048650526

   Hey @xtimbeau, I can't reproduce this, I get,
   
   ```r
   > arrow::open_dataset("/tmp/test/")  |> collect()  # crashes
   Error in `arrow::open_dataset()`:
   ! Type error: Unable to merge: Field g has incompatible types: double vs int32
   ```
   
   I wonder if this is related to what's crashing your session in https://github.com/apache/arrow/issues/40627. Similar to that one, attaching a debugger as in https://arrow.apache.org/docs/r/articles/developers/debugging.html#running-r-code-with-the-c-debugger-attached would be the best next step here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org