You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/04 11:21:00 UTC

[GitHub] [arrow] westonpace opened a new pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

westonpace opened a new pull request #10240:
URL: https://github.com/apache/arrow/pull/10240


   Currently Arrow builds that enable jemalloc also enable background threads.  There is no way to disable this without recompiling arrow.
   
   * MALLOC_CONF and /etc/malloc.conf are ignored as they are lower precedence than global variable initialization
   * Disabling dirty page purging via `jemalloc_set_decay_ms` does not terminate the background threads
   * Switching to a different allocator (e.g. python's `set_default_memory_pool`) does not terminate the background threads.
   
   This PR adds a capability to do so.  In a future PR it may be worth looking into some mechanism to prevent the threads from starting at all (e.g. migrate away from global variable initialization) but there is no good lifecycle hook for this and adding a run-once initial configuration step when the allocator is first accessed adds complexity and potential race cases.
   
   * As best I can tell this method should be thread safe but I can't find any good jemalloc documentation on this so I just left it vague.  I figure most users will be shutting down the background threads before they start using Arrow anyways.
   * I confirmed that shutting down the background threads did indeed shutdown the threads and that the valgrind potential leak goes away (FYI, the leak is a valgrind error as valgrind is not properly waiting for detached threads to shut down, more on that here: https://bugs.kde.org/show_bug.cgi?id=415141)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-839002557


   Ok, a new proposal: a single public method `arrow_allocator_use_threads(bool)` that is not jemalloc-specific (it would be a no-op for mimalloc / system) that must be called before the first Arrow allocation.
   
   Then take `background_thread` out of the jemalloc conf string and set it at first allocation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-838998475


   I don't think we should set environment variables. I'm not sure how we can solve this issue reliably, frankly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace closed pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
westonpace closed pull request #10240:
URL: https://github.com/apache/arrow/pull/10240


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-832248718


   CI failures unrelated.  CC @jonkeane @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-836749505


   @robambalu Is there a specific concern with that thread?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-839018761


   That seems a reasonable approach.  I'll add a comment to the JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-839006171


   Honestly, I don't think we should try to find quirky ways to fix this. Perhaps we can simply expose a CMake option to disable our jemalloc customization. Then people who are reliably upset by background threads can compile Arrow with the customization disabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-836705979


   Is the only point of this to appease Valgrind or is there another use case?
   If it's only to appease Valgrind, then I would suggest people use a Valgrind suppressions file instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-838076933


   Unfortunately, jemalloc configuration seems a bit [complicated](http://jemalloc.net/jemalloc.3.html). Some values are "read-only" in that you can only initialize them at process startup using a `const char*`. The `je_arrow_malloc_conf` serves that purpose; we cannot initialize those values later by reading an environment variable.
   
   > The string specified via --with-malloc-conf, the string pointed to by the global variable malloc_conf, the “name” of the file referenced by the symbolic link named /etc/malloc.conf, and the value of the environment variable MALLOC_CONF, will be interpreted, in that order, from left to right as options. Note that malloc_conf may be read before main() is entered, so the declaration of malloc_conf should specify an initializer that contains the final value to be read by jemalloc. --with-malloc-conf and malloc_conf are compile-time mechanisms, whereas /etc/malloc.conf and MALLOC_CONF can be safely set any time prior to program invocation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] robambalu commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
robambalu commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-836741870


   yes, the 1 thread ideally will not be created if the process is not using arrow at all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-831866996


   https://issues.apache.org/jira/browse/ARROW-9530


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] robambalu commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
robambalu commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-836755605


   Its mainly a nuisance, we try to keep tabs on every thread our processes creates and this one happens to be out of our control.
   @sdressler had some concerns as well ( https://issues.apache.org/jira/browse/ARROW-9530 )
   At this point its just a nuisance for me, not a show stopper.  
   If its possible to make this thread optional, great, if not I'm not gonna cry about it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-836735436


   This is what the [jemalloc doc](http://jemalloc.net/jemalloc.3.html) says:
   
   > Enable/disable internal background worker threads. When set to true, background threads are created on demand (the number of background threads will be no more than the number of CPUs or active arenas).
   
   So, if unused, it seems that almost one background thread should be created...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-836749093


   My concern here is that this PR is adding 5 jemalloc-related APIs, which is overkill.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-838994489


   Although I suppose it brings up vendoring concerns.  E.g. what if a user is trying to set MALLOC_CONF for their own app and not trying to influence Arrow (or alternatively, Arrow setting MALLOC_CONF messes up their app)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-838990718


   Yes, it appears `oversize_threshold` and `junk` are read only.  However, from my understanding of the jemalloc docs, we can still set those after startup via the `MALLOC_CONF` environment variable as long as that variable is set before we make any call to the malloc library.
   
   > Once, when the first call is made to one of the memory allocation routines, the allocator initializes its internals based in part on various options that can be specified at compile- or run-time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-837252021


   @pitrou Another option would be to defer creation of jemalloc pool until starting up, allowing the user to override the jemalloc configuration with their own configuration.  Exposes 0 new APIs but 1 new environment variable (i.e. `MALLOC_CONF`).
   
   That approach would also prevent the thread from ever being created (as opposed to this approach which allows it to be created only to immediately destroy it).
   
   However, I'm not sure what "starting up" would mean.  We could create it when the default memory pool is first asked for (if it doesn't exist).
   
   If that seems cleaner I could implement that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] robambalu commented on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
robambalu commented on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-836715977


   no, the valgrind issue is secondary, the main issue is that the background threads are spawned in any process that links in arrow ( even if its not being used ).  Others have also noted the threads are active ( I personally havent seen this )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #10240: ARROW-9530: [C++] Add option to disable jemalloc background thread on Linux

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #10240:
URL: https://github.com/apache/arrow/pull/10240#issuecomment-839006171


   Honestly, I don't think we should try to find quirky ways to fix this. Perhaps we can simply expose a CMake option to disable our jemalloc customization. Then people who are really upset by background threads can compile Arrow with the customization disabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org