You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/12 00:04:31 UTC

[GitHub] [arrow] nealrichardson opened a new pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

nealrichardson opened a new pull request #9170:
URL: https://github.com/apache/arrow/pull/9170


   Also:
   
   * Exposes `SupportedBackendNames` from the C++ library (perhaps it should be renamed and include "memory" in it, was just following what the internal functions were calling it) so we can make that visible at higher layers
   * Adds a general `arrow_info()` function that includes memory pool settings, build-time feature availability (compression codecs, S3), etc., which could be useful for debugging purposes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#issuecomment-758371255


   @github-actions rebase


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#discussion_r556072024



##########
File path: r/R/arrow-package.R
##########
@@ -76,6 +76,39 @@ option_use_threads <- function() {
   !is_false(getOption("arrow.use_threads"))
 }
 
+#' Report information on the package's capabilities
+#'
+#' This function summarizes a number of build-time configurations and run-time
+#' settings for the Arrow package. It may be useful for diagnostics.
+#' @return A list including version information, boolean "capabilities", and
+#' statistics from Arrow's memory allocator.
+#' @export
+#' @importFrom utils packageVersion
+arrow_info <- function() {
+  opts <- options()
+  out <- list(
+    version = packageVersion("arrow"),
+    libarrow = arrow_available(),
+    options = opts[grep("^arrow.", names(opts))]

Review comment:
       Escape the dot (`\\.`) to make it match only literal dots

##########
File path: r/R/arrow-package.R
##########
@@ -76,6 +76,39 @@ option_use_threads <- function() {
   !is_false(getOption("arrow.use_threads"))
 }
 
+#' Report information on the package's capabilities
+#'
+#' This function summarizes a number of build-time configurations and run-time
+#' settings for the Arrow package. It may be useful for diagnostics.
+#' @return A list including version information, boolean "capabilities", and
+#' statistics from Arrow's memory allocator.
+#' @export
+#' @importFrom utils packageVersion
+arrow_info <- function() {
+  opts <- options()
+  out <- list(
+    version = packageVersion("arrow"),

Review comment:
       `packageVersion()` returns a special object with classes `package_version` and `numeric_version`. It's probably fine, but there's a chance it might print weird in some user environments. Maybe safer to wrap it in `as.character()`?

##########
File path: r/NEWS.md
##########
@@ -35,6 +35,8 @@
 * Large string types can now be written to Parquet files
 * The [pronouns `.data` and `.env`](https://rlang.r-lib.org/reference/tidyeval-data.html) are now fully supported in Arrow `dplyr` pipelines.
 * Option `arrow.skip_nul` (default `FALSE`, as in `base::scan()`) allows conversion of Arrow string (`utf8()`) type data containing embedded nul `\0` characters to R. If set to `TRUE`, nuls will be stripped and a warning is emitted if any are found.
+* `arrow_info()` for an overview of various run-time and build-time Arrow configurations, useful for debugging
+* Set environment variable `ARROW_DEFAULT_MEMORY_POOL` before loading the Arrow package to change memory allocators. Windows packages are built with "mimalloc"; most others have "jemalloc". These are used by default if they were built, and they're generally much faster than the system malloc, but sometimes it is useful to turn them off for debugging purposes. To disable them, set `ARROW_DEFAULT_MEMORY_POOL=system`.

Review comment:
       Elsewhere in `NEWS.md`, `jemalloc` and `mimalloc` appear in backticks




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#discussion_r555989724



##########
File path: cpp/src/arrow/memory_pool.cc
##########
@@ -87,6 +87,19 @@ const char* je_arrow_malloc_conf =
 #endif  // ARROW_JEMALLOC
 
 namespace arrow {
+
+std::vector<std::string> SupportedBackendNames() {

Review comment:
       I had tried but was failing on some namespace issues; can try again.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#discussion_r556108017



##########
File path: cpp/src/arrow/memory_pool.cc
##########
@@ -87,6 +87,19 @@ const char* je_arrow_malloc_conf =
 #endif  // ARROW_JEMALLOC
 
 namespace arrow {
+
+std::vector<std::string> SupportedBackendNames() {

Review comment:
       I fixed this and renamed it to `SupportedMemoryBackendNames`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#issuecomment-758313538


   https://issues.apache.org/jira/browse/ARROW-11176


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #9170:
URL: https://github.com/apache/arrow/pull/9170


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
ianmcook commented on pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#issuecomment-759156358


   LGTM 👍 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#discussion_r555946108



##########
File path: cpp/src/arrow/memory_pool.cc
##########
@@ -87,6 +87,19 @@ const char* je_arrow_malloc_conf =
 #endif  // ARROW_JEMALLOC
 
 namespace arrow {
+
+std::vector<std::string> SupportedBackendNames() {

Review comment:
       It would be better to reuse `SupportedBackends` below. Otherwise we risk this function being out of sync.

##########
File path: r/R/arrow-package.R
##########
@@ -76,6 +76,39 @@ option_use_threads <- function() {
   !is_false(getOption("arrow.use_threads"))
 }
 
+#' Report information on the package's capabilities
+#'
+#' This function summarizes a number of build-time configurations and run-time
+#' settings for the Arrow package. It may be useful for diagnostics.
+#' @return A list including version information, boolean "capabilities", and
+#' statistics from Arrow's memory allocator.

Review comment:
       Does it make sense to include runtime statistics together with static configuration information?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#issuecomment-759128887


   I added a print method for arrow info. It looks something like this:
   
   ```
   > arrow_info()
   Arrow package version: 2.0.0.9000
   
   Capabilities:
                  
   s3         TRUE
   snappy     TRUE
   gzip       TRUE
   brotli     TRUE
   zstd       TRUE
   lz4        TRUE
   lz4_frame  TRUE
   lzo       FALSE
   bz2        TRUE
   jemalloc   TRUE
   mimalloc  FALSE
   
   Memory:
                             
   Allocator         jemalloc
   Current            0 bytes
   Max       -855217280 bytes
   
   ```
   
   The trouble is that when memory allocated is >2gb, it overflows an integer in the `int64_t` to `SEXP` conversion (https://github.com/r-lib/cpp11/issues/145). I'll see if I can work around that somehow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#discussion_r555990658



##########
File path: r/R/arrow-package.R
##########
@@ -76,6 +76,39 @@ option_use_threads <- function() {
   !is_false(getOption("arrow.use_threads"))
 }
 
+#' Report information on the package's capabilities
+#'
+#' This function summarizes a number of build-time configurations and run-time
+#' settings for the Arrow package. It may be useful for diagnostics.
+#' @return A list including version information, boolean "capabilities", and
+#' statistics from Arrow's memory allocator.

Review comment:
       I don't think it makes sense to separate them; I expect that this is a function I'll ask bug reporters to run to give me all of the arrow debug information about their setup.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#issuecomment-758301990


   @pitrou your review requested for the C++ change; you don't have to have an opinion about the R code (but you may!)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#discussion_r556107770



##########
File path: r/R/arrow-package.R
##########
@@ -76,6 +76,39 @@ option_use_threads <- function() {
   !is_false(getOption("arrow.use_threads"))
 }
 
+#' Report information on the package's capabilities
+#'
+#' This function summarizes a number of build-time configurations and run-time
+#' settings for the Arrow package. It may be useful for diagnostics.
+#' @return A list including version information, boolean "capabilities", and
+#' statistics from Arrow's memory allocator.
+#' @export
+#' @importFrom utils packageVersion
+arrow_info <- function() {
+  opts <- options()
+  out <- list(
+    version = packageVersion("arrow"),

Review comment:
       I think the special object is more useful because you can do `>` etc. methods with it. And the first line in its print method is as.character() so I think we're ok there, unless there's a scenario I'm missing:
   
   ```r
   print.numeric_version <- function (x, ...) {
       y <- as.character(x)
       if (!length(y)) 
           writeLines(gettext("<0 elements>"))
       else if (any("quote" == names(list(...)))) 
           print(ifelse(is.na(y), NA_character_, sQuote(y)), ...)
       else print(ifelse(is.na(y), NA_character_, sQuote(y)), quote = FALSE, 
           ...)
       invisible(x)
   }
   ```
   
   What we really should do is add an S3 class to this object and make a print method for it, like `sessionInfo()`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #9170: ARROW-11176: [R] Expose memory pool name and document setting it

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #9170:
URL: https://github.com/apache/arrow/pull/9170#issuecomment-759130633


   Coercing to `double` works around it:
   
   ```
   Memory:
                     
   Allocator jemalloc
   Current    7.47 Gb
   Max        11.2 Gb
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org