You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/20 15:30:08 UTC

[GitHub] [arrow] ablack3 opened a new issue, #33807: Using dplyr::tally with an Arrow FileSystemDataset crashes R

ablack3 opened a new issue, #33807:
URL: https://github.com/apache/arrow/issues/33807

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   The following code snippet crashes R. I'm using arrow 10.0.1
   
   ```
   library(dplyr)
   arrow::write_dataset(cars, here::here("cars.feather"), format = "feather")
   a <- arrow::open_dataset(here::here("cars.feather"), format = "feather")
   a %>% tally()
   ```
   
   **Platform information**
   ```
   > sessionInfo()
   R version 4.2.2 (2022-10-31)
   Platform: x86_64-apple-darwin17.0 (64-bit)
   Running under: macOS Monterey 12.6
   
   Matrix products: default
   LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
   
   locale:
   [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
   
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base     
   
   other attached packages:
   [1] arrow_10.0.1   testthat_3.1.6
   
   loaded via a namespace (and not attached):
    [1] assertthat_0.2.1 brio_1.1.3       R6_2.5.1         lifecycle_1.0.3  magrittr_2.0.3   rlang_1.0.6     
    [7] cli_3.5.0        rstudioapi_0.14  vctrs_0.5.1      tools_4.2.2      bit64_4.0.5      glue_1.6.2      
   [13] purrr_1.0.0      bit_4.0.5        compiler_4.2.2   tidyselect_1.2.0
   ```
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ablack3 commented on issue #33807: Using dplyr::tally with an Arrow FileSystemDataset crashes R

Posted by GitBox <gi...@apache.org>.
ablack3 commented on issue #33807:
URL: https://github.com/apache/arrow/issues/33807#issuecomment-1398589729

   This might be a clue
   ```
    *** caught illegal operation ***
      address 0x13d7349a8, cause 'illegal opcode'
      
      Traceback:
       1: Array__GetScalar(Array$create(x, type = type), 0)
       2: Scalar$create(x)
       3: compute___expr__scalar(Scalar$create(x))
       4: Expression$scalar(1L)
       5: n()
       6: eval_tidy(expr, mask)
       7: doTryCatch(return(expr), name, parentenv, handler)
       8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
       9: tryCatchList(expr, classes, parentenv, handlers)
      10: tryCatch(eval_tidy(expr, mask), error = function(e) {    msg <- conditionMessage(e)    if (getOption("arrow.debug", FALSE))         print(msg)    patterns <- .cache$i18ized_error_pattern    if (is.null(patterns)) {        patterns <- i18ize_error_messages()        .cache$i18ized_error_pattern <- patterns    }    if (grepl(patterns, msg)) {        stop(e)    }    out <- structure(msg, class = "try-error", condition = e)    if (grepl("not supported.*Arrow", msg) || getOption("arrow.debug",         FALSE)) {        class(out) <- c("arrow-try-error", class(out))    }    invisible(out)})
      11: arrow_eval(expr, mask)
      12: arrow_eval_or_stop(as_quosure(expr, ctx$quo_env), ctx$mask)
      13: summarize_eval(names(exprs)[i], exprs[[i]], ctx, length(.data$group_by_vars) >     0)
      14: do_arrow_summarize(.data, !!!exprs, .groups = .groups)
      15: doTryCatch(return(expr), name, parentenv, handler)
      16: tryCatchOne(expr, names, parentenv, handlers[[1L]])
      17: tryCatchList(expr, classes, parentenv, handlers)
      18: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call, nlines = 1L)        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
      19: try(do_arrow_summarize(.data, !!!exprs, .groups = .groups), silent = TRUE)
      20: summarise.ArrowTabular(x, `:=`(!!name, n()))
      21: dplyr::summarize(x, `:=`(!!name, n()))
      22: tally.ArrowTabular(.)
      23: tally(.)
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #33807: Using dplyr::tally with an Arrow FileSystemDataset crashes R

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #33807:
URL: https://github.com/apache/arrow/issues/33807#issuecomment-1403903919

   Hi @ablack3, thanks for reporting this!  I haven't been able to reproduce this myself, though I am using Ubuntu 22.04 and not macOS.  You could get more verbose output by attaching the C++ debugger before running R via the instructions here: https://arrow.apache.org/docs/dev/r/articles/developers/debugging.html
   
   Can you show me the output of running `arrow::arrow_info()`?  There might be some clues there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #33807: Using dplyr::tally with an Arrow FileSystemDataset crashes R

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #33807:
URL: https://github.com/apache/arrow/issues/33807#issuecomment-1433483523

   Thank you for this! I am guessing that whatever runtime detection mechanism we're using might not be working with rosetta.
   
   Do we know if there's any way to force Arrow to pretend that SIMD doesn't exist at runtime?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ablack3 commented on issue #33807: Using dplyr::tally with an Arrow FileSystemDataset crashes R

Posted by "ablack3 (via GitHub)" <gi...@apache.org>.
ablack3 commented on issue #33807:
URL: https://github.com/apache/arrow/issues/33807#issuecomment-1433289130

   Sorry for the delay.  Here is `arrow::arrow_info()`
   
   ``` r
   arrow::arrow_info()
   #> Arrow package version: 11.0.0.2
   #> 
   #> Capabilities:
   #>                
   #> dataset    TRUE
   #> substrait FALSE
   #> parquet    TRUE
   #> json       TRUE
   #> s3         TRUE
   #> gcs        TRUE
   #> utf8proc   TRUE
   #> re2        TRUE
   #> snappy     TRUE
   #> gzip       TRUE
   #> brotli     TRUE
   #> zstd       TRUE
   #> lz4        TRUE
   #> lz4_frame  TRUE
   #> lzo       FALSE
   #> bz2        TRUE
   #> jemalloc   TRUE
   #> mimalloc   TRUE
   #> 
   #> Memory:
   #>                   
   #> Allocator mimalloc
   #> Current    0 bytes
   #> Max        0 bytes
   #> 
   #> Runtime:
   #>                           
   #> SIMD Level          sse4_2
   #> Detected SIMD Level sse4_2
   #> 
   #> Build:
   #>                                     
   #> C++ Library Version           11.0.0
   #> C++ Compiler              AppleClang
   #> C++ Compiler Version 10.0.0.10001145
   ```
   
   <sup>Created on 2023-02-16 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
   
   Thanks for the debugging instructions @thisisnic. 
   
   
   
   
   > Out of curiosity, are you on M1 running a x86_64 version of R? (Or are you on an Intel-based Mac?)
   
   I am on an M1 running a x86_64 version of R. I used to run the arm version of R but had issues with odbc drivers not working with arm so had to move my R installation to x86_64 via Rosetta.
   
   <img width="635" alt="image" src="https://user-images.githubusercontent.com/10227522/219414637-d8f4452c-e3f3-4806-8006-d8e0bc968830.png">
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jonkeane closed issue #33807: [R] Using dplyr::tally with an Arrow FileSystemDataset crashes R

Posted by "jonkeane (via GitHub)" <gi...@apache.org>.
jonkeane closed issue #33807: [R] Using dplyr::tally with an Arrow FileSystemDataset crashes R
URL: https://github.com/apache/arrow/issues/33807


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ablack3 commented on issue #33807: Using dplyr::tally with an Arrow FileSystemDataset crashes R

Posted by "ablack3 (via GitHub)" <gi...@apache.org>.
ablack3 commented on issue #33807:
URL: https://github.com/apache/arrow/issues/33807#issuecomment-1462896715

   This is still crashing R on my machine. I'm using arrow v11.0.0.2
   ```
   Sys.setenv(ARROW_USER_SIMD_LEVEL="NONE")
   
   library(dplyr)
   arrow::write_dataset(cars, here::here("cars.feather"), format = "feather")
   a <- arrow::open_dataset(here::here("cars.feather"), format = "feather")
   a %>% tally()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #33807: Using dplyr::tally with an Arrow FileSystemDataset crashes R

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #33807:
URL: https://github.com/apache/arrow/issues/33807#issuecomment-1433954863

   > Do we know if there's any way to force Arrow to pretend that SIMD doesn't exist at runtime?
   
   You can try and set the environment variable `ARROW_USER_SIMD_LEVEL` to `NONE`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #33807: Using dplyr::tally with an Arrow FileSystemDataset crashes R

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #33807:
URL: https://github.com/apache/arrow/issues/33807#issuecomment-1403952652

   We've had some problems with MacOS and illegal opcodes (#28343, #14826 most recently)...it seems that the way we detect the SIMD level is sometimes not working for Intel MacOS.
   
   Out of curiosity, are you on M1 running a x86_64 version of R? (Or are you on an Intel-based Mac?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ianmcook commented on issue #33807: Using dplyr::tally with an Arrow FileSystemDataset crashes R

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.
ianmcook commented on issue #33807:
URL: https://github.com/apache/arrow/issues/33807#issuecomment-1403963517

   It might be useful to see the output of `sysctl machdep.cpu` from your macOS terminal


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jonkeane commented on issue #33807: [R] Using dplyr::tally with an Arrow FileSystemDataset crashes R

Posted by "jonkeane (via GitHub)" <gi...@apache.org>.
jonkeane commented on issue #33807:
URL: https://github.com/apache/arrow/issues/33807#issuecomment-1723713582

   We ran into something like this a few times at @thisisnic and @stephhazlitt 's workshop. What happened was that some folks using Apple ARM-based machines were using R built for x86 (running under Rosetta emulation), and therefore received Arrow package binaries intended for x86, which will crash with illegal op codes. 
   
   R has had native builds for R for a long time now (and there are native ARM builds for arrow which work well), so if people are using ARM-based macs, we recommend installing native R and native arrow. 
   
   I will also send a PR shortly that adds a detection + warning on package load for arrow if we detect this so that folks know that they should run native R and things will work fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org