You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jeanetteclark (via GitHub)" <gi...@apache.org> on 2023/03/29 20:42:37 UTC

[GitHub] [arrow] jeanetteclark opened a new issue, #34780: Seg fault on `write_dataset` with partition and threading

jeanetteclark opened a new issue, #34780:
URL: https://github.com/apache/arrow/issues/34780

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hi Arrow! I recently (with Arrow 11) have been getting a complete R crash when running `write_dataset` on a dataset that previously had been working fine. I haven't been able to reproduce using dummy data, so my best reprex is below using my production dataset. If I turn threading off, everything works, but it takes almost 10 minutes, much longer than it used to, so I'd prefer to not turn threading off if I can.
   
   ```
   library(arrow)
   
   dest <- tempfile()
   t <- getOption('timeout')
   options(timeout = 600)
   
   # 18 MB
   download.file("https://portal.edirepository.org/nis/dataviewer?packageid=edi.1075.1&entityid=926f4aa8484f185b69bc1827fa67d40c",
                 dest)
   
   load(dest) # ~2 GB uncompressed
   
   options(arrow.use_threads = TRUE)
   
   system.time(write_dataset(res_fish,
                 "test_data",
                 format = "parquet",
                 partitioning = "Taxa"))
   
   options(timeout = t)
   ```
   
   <details>
     <summary>Session Info</summary>
   
     ```
   R version 4.2.3 (2023-03-15)
   Platform: aarch64-apple-darwin20 (64-bit)
   Running under: macOS Monterey 12.6
   
   Matrix products: default
   LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
   
   locale:
   [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
   
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base     
   
   loaded via a namespace (and not attached):
    [1] arrow_11.0.0.3   assertthat_0.2.1 brio_1.1.3       rappdirs_0.3.3   R6_2.5.1         lifecycle_1.0.3 
    [7] magrittr_2.0.3   rlang_1.1.0      cli_3.6.1        rstudioapi_0.14  testthat_3.1.7   vctrs_0.6.1     
   [13] tools_4.2.3      bit64_4.0.5      glue_1.6.2       purrr_1.0.1      bit_4.0.5        compiler_4.2.3  
   [19] tidyselect_1.2.0 EDIutils_1.0.2  
     ```
   </details>
   
   The error I got by following the instructions from the very helpful debugging page is:
   
   ```
   > write_dataset(res_fish,
                 "test_data",
                 format = "parquet", 
         +               "test_data",
   +               format = "parquet", 
   +               partitioning = "Taxa")
   Process 1622 stopped
   * thread #14, stop reason = EXC_BAD_ACCESS (code=2, address=0x170493fe0)
       frame #0: 0x00000001b885a994 libsystem_malloc.dylib`nanov2_allocate_from_block + 8
   libsystem_malloc.dylib`nanov2_allocate_from_block:
   ->  0x1b885a994 <+8>:  stp    x28, x27, [sp, #0x20]
       0x1b885a998 <+12>: stp    x26, x25, [sp, #0x30]
       0x1b885a99c <+16>: stp    x24, x23, [sp, #0x40]
       0x1b885a9a0 <+20>: stp    x22, x21, [sp, #0x50]
   Target 0: (R) stopped.
   
   ```
   
   Thanks for any help - and sorry I haven't been able to get an example together that doesn't require downloading a bunch of data
   
   Potentially related issues are #34211 and #34539
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #34780: [R] Seg fault on `write_dataset` with partition and threading on MacOS

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #34780:
URL: https://github.com/apache/arrow/issues/34780#issuecomment-1492190892

   Thanks for the heads up! I tried again with the latest and greatest Arrow on the development branch and it still crashes.
   
   I wonder if this is just out-of-memory that is manifesting in a number of different ways? The latest error I got was a SIGKILL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jeanetteclark commented on issue #34780: [R] Seg fault on `write_dataset` with partition and threading on MacOS

Posted by "jeanetteclark (via GitHub)" <gi...@apache.org>.
jeanetteclark commented on issue #34780:
URL: https://github.com/apache/arrow/issues/34780#issuecomment-1492201385

   Do you know why we wouldn't get an error if threading and/or partitioning is turned off? We have had great success with our package for a year+ ever since Arrow 8 was released, using the exact same dataset/methods, so I'm not sure why a memory error would be cropping up now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jeanetteclark commented on issue #34780: [R] Seg fault on `write_dataset` with partition and threading on MacOS

Posted by "jeanetteclark (via GitHub)" <gi...@apache.org>.
jeanetteclark commented on issue #34780:
URL: https://github.com/apache/arrow/issues/34780#issuecomment-1492123815

   Glad you were able to reproduce! I just tried the URL and it works for me - maybe they had an outage? Let me know if you continue to have issues


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #34780: [R] Seg fault on `write_dataset` with partition and threading on MacOS

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #34780:
URL: https://github.com/apache/arrow/issues/34780#issuecomment-1491937333

   Thank you for reporting! I was able to replicate this twice on M1 (once with the segfault on `illegal hardware instruction` and once with the segfault you described here).
   
   When I tried to try with the dev Arrow version that we're about to release, I got a 503 Service Unavailable when I tried to download the test file! (I will try again soon or perhaps there's another way to provide the test file?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #34780: [R] Seg fault on `write_dataset` with partition and threading on MacOS

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #34780:
URL: https://github.com/apache/arrow/issues/34780#issuecomment-1492255288

   Anecdotally I do think it's hard to beat Arrow's group_by + write_dataset in terms of speed but I do think that it uses a lot of memory (some of that is undiagnosed because these problems are rather difficult to debug).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #34780: [R] Seg fault on `write_dataset` with partition and threading on MacOS

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #34780:
URL: https://github.com/apache/arrow/issues/34780#issuecomment-1492215539

   The query engine which the R bindings provide access to undergoes many changes over the course of most release cycles. Some of those changes are easy to connect to issues in the R bindings but since I'm not directly involved in query engine development I can't always spot which changes may have caused an error.
   
   You could try DuckDB as a workaround, which can scan an R data frame like this one (via `duckdb_register()`). You might need to do a little more hand holding (e.g., calculate DISTINCT(Taxa) taxa then loop over the unique taxa and write each to a parquet file).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jeanetteclark commented on issue #34780: [R] Seg fault on `write_dataset` with partition and threading on MacOS

Posted by "jeanetteclark (via GitHub)" <gi...@apache.org>.
jeanetteclark commented on issue #34780:
URL: https://github.com/apache/arrow/issues/34780#issuecomment-1492226344

   yeah...we had to use DuckDB as a workaround in a previous release because some joins weren't working on windows (fixed in Arrow 8) - I'm not the most psyched on reintroducing it but maybe I'll try...IIRC we got better performance without DuckDB but I could be mistaken.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] snizovtsev commented on issue #34780: [R] Seg fault on `write_dataset` with partition and threading on MacOS

Posted by "snizovtsev (via GitHub)" <gi...@apache.org>.
snizovtsev commented on issue #34780:
URL: https://github.com/apache/arrow/issues/34780#issuecomment-1534661286

   This could be a dup of #34539 fixed in Arrow 12.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org