You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by John Muehlhausen <jg...@jgm.org> on 2022/06/14 14:06:51 UTC

Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Hello,

This comment is regarding installation with `apt` on ubuntu 18.04 ...
`libarrow-dev/bionic,now 8.0.0-1 amd64`

I'm a bit confused about the memory pool situation:

* I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
`arrow::default_memory_pool()->backend_name() ==
arrow::system_memory_pool()->backend_name()`

* I then LD_PRELOAD a customized (*) mimalloc according to the directions
at the mimalloc git repo and things like `strm->Reset(INT32_MAX);` seem not
to be hitting it... I figured that is a big enough chunk to jostle it into
doing something... `BufferOutputStream::Create(INT32_MAX)` is also not
intercepted by mimalloc.  Is the "system" pool somehow going around the
typical allocation interfaces on linux?  I built my own .so and linked it
to the app and malloc() is getting intercepted.

* `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
apparently not "my" mimalloc ... statically linked?

* what is going on in Arrow with constructor (pre-main()) allocations?
Some of this does hit my LD_PRELOADed mimalloc

* any way to get symbols for the apt-installed libs or would I need to
build from source to get backtrace with symbols? (for chasing down sources
of allocations)

* what is the C++ lib equivalent of the following from the Python code?  I
figure I could stop trying to understand the built-in/default allocators if
I could just replace them... but this may also intersect with my question
about constructors.  Maybe I'd have to make sure my constructor runs first
to perform the switch-a-roo before anything else tries to use the default
pool?

```
namespace py {

static std::mutex memory_pool_mutex;
static MemoryPool* default_python_pool = nullptr;

void set_default_memory_pool(MemoryPool* pool) {
  std::lock_guard<std::mutex> guard(memory_pool_mutex);
  default_python_pool = pool;
}
```


(*) the mimalloc customization: the main app has a weak reference that ends
up defined by the LD_PRELOAD mimalloc, where the function so-supplied
allows the app to install a function pointer (back to the main app) that
gets called (if defined) at various interesting points in mimalloc


Thanks,
John

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,

Could you try https://github.com/apache/arrow/pull/13373 ?
This will work with -DARROW_JEMALLOC=ON because it doesn't
override posix_memalign() in the system memory pool even
when -DARROW_JEMALLOC=ON is specified.

Thanks,
-- 
kou

In <20...@clear-code.com>
  "Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool" on Wed, 15 Jun 2022 08:38:54 +0900 (JST),
  Sutou Kouhei <ko...@clear-code.com> wrote:

> Hi,
> 
> I think that compiler builtins aren't related. Could you try
> only with -DARROW_JEMALLOC=OFF?
> 
> Thanks,
> --
> kou
> 
> In <CA...@mail.gmail.com>
>   "Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool" on Tue, 14 Jun 2022 18:32:00 -0500,
>   John Muehlhausen <jg...@jgm.org> wrote:
> 
>> Thanks for the reply.  I had disabled jemalloc
>> via ARROW_DEFAULT_MEMORY_POOL so that was not the issue.
>> 
>> The issue was (I think) that the arrow lib I was using was built with
>> compiler builtins (such as __builtin_posix_memalign) so that even the
>> system default allocator wasn't able to be intercepted.
>> 
>> One way to solve this is to build Arrow with -fno-builtin, but
>> unfortunately that disables a lot of builtins that a person may still
>> want.  Since allocation is a whole family of functions and not just a few,
>> it is somewhat difficult to determine which builtins to selectively
>> disallow.  It would be nice if some project (arrow? mimalloc?) made such
>> documentation for popular compilers that substitute builtins for allocation
>> routines.
>> 
>> I opened an issue on mimalloc for this documentation... or at least a
>> warning about builtins for those using the interception techniques such as
>> LD_PRELOAD.
>> 
>> -John
>> 
>> On Tue, Jun 14, 2022 at 3:40 PM Sutou Kouhei <ko...@clear-code.com> wrote:
>> 
>>> Hi,
>>>
>>> posix_memalign() in memory_pool.cc of libarrow-dev uses
>>> jemalloc's posix_memalign() (je_posix_memalign()). Because
>>> it's built with ARROW_JEMALLOC=ON (default) and
>>> JEMALLOC_MANGLE
>>>
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L53
>>> . So we can't use mimalloc with LD_PRELOAD.
>>>
>>> The comment for JEMALLOC_MANGLE in
>>> memory_pool.c said "Needed to support jemalloc 3 and 4" bu
>>> we bundle jemalloc 5.2.1 now. So we can remove JEMALLOC_MANGLE.
>>>
>>> Could you open an issue on Jira
>>> https://issues.apache.org/jira/browse/ARROW to add support
>>> for overriding system memory pool's allocator by LD_PRELOAD?
>>> (Do you want to work on this?)
>>>
>>>
>>> Thanks,
>>> --
>>> kou
>>>
>>> In <CA...@mail.gmail.com>
>>>   "Custom default C++ memory pool on Linux, and/or interception/auditing
>>> of system pool" on Tue, 14 Jun 2022 09:06:51 -0500,
>>>   John Muehlhausen <jg...@jgm.org> wrote:
>>>
>>> > Hello,
>>> >
>>> > This comment is regarding installation with `apt` on ubuntu 18.04 ...
>>> > `libarrow-dev/bionic,now 8.0.0-1 amd64`
>>> >
>>> > I'm a bit confused about the memory pool situation:
>>> >
>>> > * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
>>> > `arrow::default_memory_pool()->backend_name() ==
>>> > arrow::system_memory_pool()->backend_name()`
>>> >
>>> > * I then LD_PRELOAD a customized (*) mimalloc according to the directions
>>> > at the mimalloc git repo and things like `strm->Reset(INT32_MAX);` seem
>>> not
>>> > to be hitting it... I figured that is a big enough chunk to jostle it
>>> into
>>> > doing something... `BufferOutputStream::Create(INT32_MAX)` is also not
>>> > intercepted by mimalloc.  Is the "system" pool somehow going around the
>>> > typical allocation interfaces on linux?  I built my own .so and linked it
>>> > to the app and malloc() is getting intercepted.
>>> >
>>> > * `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
>>> > apparently not "my" mimalloc ... statically linked?
>>> >
>>> > * what is going on in Arrow with constructor (pre-main()) allocations?
>>> > Some of this does hit my LD_PRELOADed mimalloc
>>> >
>>> > * any way to get symbols for the apt-installed libs or would I need to
>>> > build from source to get backtrace with symbols? (for chasing down
>>> sources
>>> > of allocations)
>>> >
>>> > * what is the C++ lib equivalent of the following from the Python code?
>>> I
>>> > figure I could stop trying to understand the built-in/default allocators
>>> if
>>> > I could just replace them... but this may also intersect with my question
>>> > about constructors.  Maybe I'd have to make sure my constructor runs
>>> first
>>> > to perform the switch-a-roo before anything else tries to use the default
>>> > pool?
>>> >
>>> > ```
>>> > namespace py {
>>> >
>>> > static std::mutex memory_pool_mutex;
>>> > static MemoryPool* default_python_pool = nullptr;
>>> >
>>> > void set_default_memory_pool(MemoryPool* pool) {
>>> >   std::lock_guard<std::mutex> guard(memory_pool_mutex);
>>> >   default_python_pool = pool;
>>> > }
>>> > ```
>>> >
>>> >
>>> > (*) the mimalloc customization: the main app has a weak reference that
>>> ends
>>> > up defined by the LD_PRELOAD mimalloc, where the function so-supplied
>>> > allows the app to install a function pointer (back to the main app) that
>>> > gets called (if defined) at various interesting points in mimalloc
>>> >
>>> >
>>> > Thanks,
>>> > John
>>>

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,

I think that compiler builtins aren't related. Could you try
only with -DARROW_JEMALLOC=OFF?

Thanks,
--
kou

In <CA...@mail.gmail.com>
  "Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool" on Tue, 14 Jun 2022 18:32:00 -0500,
  John Muehlhausen <jg...@jgm.org> wrote:

> Thanks for the reply.  I had disabled jemalloc
> via ARROW_DEFAULT_MEMORY_POOL so that was not the issue.
> 
> The issue was (I think) that the arrow lib I was using was built with
> compiler builtins (such as __builtin_posix_memalign) so that even the
> system default allocator wasn't able to be intercepted.
> 
> One way to solve this is to build Arrow with -fno-builtin, but
> unfortunately that disables a lot of builtins that a person may still
> want.  Since allocation is a whole family of functions and not just a few,
> it is somewhat difficult to determine which builtins to selectively
> disallow.  It would be nice if some project (arrow? mimalloc?) made such
> documentation for popular compilers that substitute builtins for allocation
> routines.
> 
> I opened an issue on mimalloc for this documentation... or at least a
> warning about builtins for those using the interception techniques such as
> LD_PRELOAD.
> 
> -John
> 
> On Tue, Jun 14, 2022 at 3:40 PM Sutou Kouhei <ko...@clear-code.com> wrote:
> 
>> Hi,
>>
>> posix_memalign() in memory_pool.cc of libarrow-dev uses
>> jemalloc's posix_memalign() (je_posix_memalign()). Because
>> it's built with ARROW_JEMALLOC=ON (default) and
>> JEMALLOC_MANGLE
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L53
>> . So we can't use mimalloc with LD_PRELOAD.
>>
>> The comment for JEMALLOC_MANGLE in
>> memory_pool.c said "Needed to support jemalloc 3 and 4" bu
>> we bundle jemalloc 5.2.1 now. So we can remove JEMALLOC_MANGLE.
>>
>> Could you open an issue on Jira
>> https://issues.apache.org/jira/browse/ARROW to add support
>> for overriding system memory pool's allocator by LD_PRELOAD?
>> (Do you want to work on this?)
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In <CA...@mail.gmail.com>
>>   "Custom default C++ memory pool on Linux, and/or interception/auditing
>> of system pool" on Tue, 14 Jun 2022 09:06:51 -0500,
>>   John Muehlhausen <jg...@jgm.org> wrote:
>>
>> > Hello,
>> >
>> > This comment is regarding installation with `apt` on ubuntu 18.04 ...
>> > `libarrow-dev/bionic,now 8.0.0-1 amd64`
>> >
>> > I'm a bit confused about the memory pool situation:
>> >
>> > * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
>> > `arrow::default_memory_pool()->backend_name() ==
>> > arrow::system_memory_pool()->backend_name()`
>> >
>> > * I then LD_PRELOAD a customized (*) mimalloc according to the directions
>> > at the mimalloc git repo and things like `strm->Reset(INT32_MAX);` seem
>> not
>> > to be hitting it... I figured that is a big enough chunk to jostle it
>> into
>> > doing something... `BufferOutputStream::Create(INT32_MAX)` is also not
>> > intercepted by mimalloc.  Is the "system" pool somehow going around the
>> > typical allocation interfaces on linux?  I built my own .so and linked it
>> > to the app and malloc() is getting intercepted.
>> >
>> > * `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
>> > apparently not "my" mimalloc ... statically linked?
>> >
>> > * what is going on in Arrow with constructor (pre-main()) allocations?
>> > Some of this does hit my LD_PRELOADed mimalloc
>> >
>> > * any way to get symbols for the apt-installed libs or would I need to
>> > build from source to get backtrace with symbols? (for chasing down
>> sources
>> > of allocations)
>> >
>> > * what is the C++ lib equivalent of the following from the Python code?
>> I
>> > figure I could stop trying to understand the built-in/default allocators
>> if
>> > I could just replace them... but this may also intersect with my question
>> > about constructors.  Maybe I'd have to make sure my constructor runs
>> first
>> > to perform the switch-a-roo before anything else tries to use the default
>> > pool?
>> >
>> > ```
>> > namespace py {
>> >
>> > static std::mutex memory_pool_mutex;
>> > static MemoryPool* default_python_pool = nullptr;
>> >
>> > void set_default_memory_pool(MemoryPool* pool) {
>> >   std::lock_guard<std::mutex> guard(memory_pool_mutex);
>> >   default_python_pool = pool;
>> > }
>> > ```
>> >
>> >
>> > (*) the mimalloc customization: the main app has a weak reference that
>> ends
>> > up defined by the LD_PRELOAD mimalloc, where the function so-supplied
>> > allows the app to install a function pointer (back to the main app) that
>> > gets called (if defined) at various interesting points in mimalloc
>> >
>> >
>> > Thanks,
>> > John
>>

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by John Muehlhausen <jg...@jgm.org>.
Thanks for the reply.  I had disabled jemalloc
via ARROW_DEFAULT_MEMORY_POOL so that was not the issue.

The issue was (I think) that the arrow lib I was using was built with
compiler builtins (such as __builtin_posix_memalign) so that even the
system default allocator wasn't able to be intercepted.

One way to solve this is to build Arrow with -fno-builtin, but
unfortunately that disables a lot of builtins that a person may still
want.  Since allocation is a whole family of functions and not just a few,
it is somewhat difficult to determine which builtins to selectively
disallow.  It would be nice if some project (arrow? mimalloc?) made such
documentation for popular compilers that substitute builtins for allocation
routines.

I opened an issue on mimalloc for this documentation... or at least a
warning about builtins for those using the interception techniques such as
LD_PRELOAD.

-John

On Tue, Jun 14, 2022 at 3:40 PM Sutou Kouhei <ko...@clear-code.com> wrote:

> Hi,
>
> posix_memalign() in memory_pool.cc of libarrow-dev uses
> jemalloc's posix_memalign() (je_posix_memalign()). Because
> it's built with ARROW_JEMALLOC=ON (default) and
> JEMALLOC_MANGLE
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L53
> . So we can't use mimalloc with LD_PRELOAD.
>
> The comment for JEMALLOC_MANGLE in
> memory_pool.c said "Needed to support jemalloc 3 and 4" bu
> we bundle jemalloc 5.2.1 now. So we can remove JEMALLOC_MANGLE.
>
> Could you open an issue on Jira
> https://issues.apache.org/jira/browse/ARROW to add support
> for overriding system memory pool's allocator by LD_PRELOAD?
> (Do you want to work on this?)
>
>
> Thanks,
> --
> kou
>
> In <CA...@mail.gmail.com>
>   "Custom default C++ memory pool on Linux, and/or interception/auditing
> of system pool" on Tue, 14 Jun 2022 09:06:51 -0500,
>   John Muehlhausen <jg...@jgm.org> wrote:
>
> > Hello,
> >
> > This comment is regarding installation with `apt` on ubuntu 18.04 ...
> > `libarrow-dev/bionic,now 8.0.0-1 amd64`
> >
> > I'm a bit confused about the memory pool situation:
> >
> > * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
> > `arrow::default_memory_pool()->backend_name() ==
> > arrow::system_memory_pool()->backend_name()`
> >
> > * I then LD_PRELOAD a customized (*) mimalloc according to the directions
> > at the mimalloc git repo and things like `strm->Reset(INT32_MAX);` seem
> not
> > to be hitting it... I figured that is a big enough chunk to jostle it
> into
> > doing something... `BufferOutputStream::Create(INT32_MAX)` is also not
> > intercepted by mimalloc.  Is the "system" pool somehow going around the
> > typical allocation interfaces on linux?  I built my own .so and linked it
> > to the app and malloc() is getting intercepted.
> >
> > * `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
> > apparently not "my" mimalloc ... statically linked?
> >
> > * what is going on in Arrow with constructor (pre-main()) allocations?
> > Some of this does hit my LD_PRELOADed mimalloc
> >
> > * any way to get symbols for the apt-installed libs or would I need to
> > build from source to get backtrace with symbols? (for chasing down
> sources
> > of allocations)
> >
> > * what is the C++ lib equivalent of the following from the Python code?
> I
> > figure I could stop trying to understand the built-in/default allocators
> if
> > I could just replace them... but this may also intersect with my question
> > about constructors.  Maybe I'd have to make sure my constructor runs
> first
> > to perform the switch-a-roo before anything else tries to use the default
> > pool?
> >
> > ```
> > namespace py {
> >
> > static std::mutex memory_pool_mutex;
> > static MemoryPool* default_python_pool = nullptr;
> >
> > void set_default_memory_pool(MemoryPool* pool) {
> >   std::lock_guard<std::mutex> guard(memory_pool_mutex);
> >   default_python_pool = pool;
> > }
> > ```
> >
> >
> > (*) the mimalloc customization: the main app has a weak reference that
> ends
> > up defined by the LD_PRELOAD mimalloc, where the function so-supplied
> > allows the app to install a function pointer (back to the main app) that
> > gets called (if defined) at various interesting points in mimalloc
> >
> >
> > Thanks,
> > John
>

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,

posix_memalign() in memory_pool.cc of libarrow-dev uses
jemalloc's posix_memalign() (je_posix_memalign()). Because
it's built with ARROW_JEMALLOC=ON (default) and
JEMALLOC_MANGLE
https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L53
. So we can't use mimalloc with LD_PRELOAD.

The comment for JEMALLOC_MANGLE in
memory_pool.c said "Needed to support jemalloc 3 and 4" bu
we bundle jemalloc 5.2.1 now. So we can remove JEMALLOC_MANGLE.

Could you open an issue on Jira
https://issues.apache.org/jira/browse/ARROW to add support
for overriding system memory pool's allocator by LD_PRELOAD?
(Do you want to work on this?)


Thanks,
-- 
kou

In <CA...@mail.gmail.com>
  "Custom default C++ memory pool on Linux, and/or interception/auditing of system pool" on Tue, 14 Jun 2022 09:06:51 -0500,
  John Muehlhausen <jg...@jgm.org> wrote:

> Hello,
> 
> This comment is regarding installation with `apt` on ubuntu 18.04 ...
> `libarrow-dev/bionic,now 8.0.0-1 amd64`
> 
> I'm a bit confused about the memory pool situation:
> 
> * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
> `arrow::default_memory_pool()->backend_name() ==
> arrow::system_memory_pool()->backend_name()`
> 
> * I then LD_PRELOAD a customized (*) mimalloc according to the directions
> at the mimalloc git repo and things like `strm->Reset(INT32_MAX);` seem not
> to be hitting it... I figured that is a big enough chunk to jostle it into
> doing something... `BufferOutputStream::Create(INT32_MAX)` is also not
> intercepted by mimalloc.  Is the "system" pool somehow going around the
> typical allocation interfaces on linux?  I built my own .so and linked it
> to the app and malloc() is getting intercepted.
> 
> * `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
> apparently not "my" mimalloc ... statically linked?
> 
> * what is going on in Arrow with constructor (pre-main()) allocations?
> Some of this does hit my LD_PRELOADed mimalloc
> 
> * any way to get symbols for the apt-installed libs or would I need to
> build from source to get backtrace with symbols? (for chasing down sources
> of allocations)
> 
> * what is the C++ lib equivalent of the following from the Python code?  I
> figure I could stop trying to understand the built-in/default allocators if
> I could just replace them... but this may also intersect with my question
> about constructors.  Maybe I'd have to make sure my constructor runs first
> to perform the switch-a-roo before anything else tries to use the default
> pool?
> 
> ```
> namespace py {
> 
> static std::mutex memory_pool_mutex;
> static MemoryPool* default_python_pool = nullptr;
> 
> void set_default_memory_pool(MemoryPool* pool) {
>   std::lock_guard<std::mutex> guard(memory_pool_mutex);
>   default_python_pool = pool;
> }
> ```
> 
> 
> (*) the mimalloc customization: the main app has a weak reference that ends
> up defined by the LD_PRELOAD mimalloc, where the function so-supplied
> allows the app to install a function pointer (back to the main app) that
> gets called (if defined) at various interesting points in mimalloc
> 
> 
> Thanks,
> John

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by John Muehlhausen <jg...@jgm.org>.
A minimal build using the following seems to have solved my problem.  The
various no-builtin params are guesswork based largely on alloc-override.c
from mimalloc.  It would be nice if someone documented somewhere how to
turn off classes of builtins for each popular compiler or if this received
compiler support (e.g. -fno-builtingroup-allocation)... turning off ALL
builtins seems too heavy-handed.

cmake -E env CFLAGS="-fno-builtin-malloc -fno-builtin-calloc
-fno-builtin-realloc -fno-builtin-free -fno-builtin-reallocf
-fno-builtin-malloc_size -fno-builtin-malloc_usable_size
-fno-builtin-valloc -fno-builtin-vfree -fno-builtin-malloc_good_size
-fno-builtin-posix_memalign -fno-builtin-alligned_alloc -fno-builtin-cfree
-fno-builtin-pvalloc -fno-builtin-reallocarray -fno-builtin-reallocarr
-fno-builtin-memalign -fno-builtin-_aligned_malloc
-fno-builtin-__libc_malloc -fno-builtin-__libc_calloc
-fno-builtin-__libc_realloc -fno-builtin-__libc_free
-fno-builtin-__libc_cfree -fno-builtin-__libc_valloc
-fno-builtin-__libc_pvalloc -fno-builtin-__libc_memalign
-fno-builtin-__posix_memalign -fno-builtin-operator_new
-fno-builtin-operator_delete" CXXFLAGS="-fno-builtin-malloc
-fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free
-fno-builtin-reallocf -fno-builtin-malloc_size
-fno-builtin-malloc_usable_size -fno-builtin-valloc -fno-builtin-vfree
-fno-builtin-malloc_good_size -fno-builtin-posix_memalign
-fno-builtin-alligned_alloc -fno-builtin-cfree -fno-builtin-pvalloc
-fno-builtin-reallocarray -fno-builtin-reallocarr -fno-builtin-memalign
-fno-builtin-_aligned_malloc -fno-builtin-__libc_malloc
-fno-builtin-__libc_calloc -fno-builtin-__libc_realloc
-fno-builtin-__libc_free -fno-builtin-__libc_cfree
-fno-builtin-__libc_valloc -fno-builtin-__libc_pvalloc
-fno-builtin-__libc_memalign -fno-builtin-__posix_memalign
-fno-builtin-operator_new -fno-builtin-operator_delete" cmake --preset
ninja-debug-minimal -DARROW_JEMALLOC=OFF -DARROW_MIMALLOC=OFF
-DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=/usr/local ..

On Tue, Jun 14, 2022 at 12:36 PM John Muehlhausen <jg...@jgm.org> wrote:

> My best guess at this moment is that the Arrow lib I'm using was built
> with a compiler that had something like __builtin_posix_memalign in effect
> ??
>
> I say this because deploying __builtin_malloc has the same deleterious
> effect on my own .so
>
> On Tue, Jun 14, 2022 at 10:53 AM John Muehlhausen <jg...@jgm.org> wrote:
>
>> I'm using ARROW_DEFAULT_MEMORY_POOL=system
>>
>> Based on a review of memory_pool.cc I expect this to become
>> posix_memalign calls on Linux
>>
>> When I call posiix_memalign in a .so that I created and linked with my
>> app, using LD_PRELOAD=/usr/local/lib/libmimalloc.so to run the app, these
>> calls get forwarded to mi_posix_memalign (because I threw a prinf in there
>> and re-built mimalloc)... note, I'm not talking about Arrow's built-in
>> mimalloc.
>>
>> Maybe Arrow's mimalloc is keeping the LD_PRELOAD of my custom mimalloc
>> from taking effect?  How is mimalloc included in Arrow?  When I
>> call arrow::mimalloc_memory_pool() I do get an Ok status, so it is in the
>> build I'm using from `apt`
>>
>> -John
>>
>> On Tue, Jun 14, 2022 at 10:37 AM Weston Pace <we...@gmail.com>
>> wrote:
>>
>>> Sorry, that should have said "when Arrow builds jemalloc".  Here is
>>> the command we send down (from ThirdPartyToolchain.cmake):
>>>
>>> ```
>>> JEMALLOC_CONFIGURE_COMMAND
>>> "--prefix=${JEMALLOC_PREFIX}"
>>> "--libdir=${JEMALLOC_LIB_DIR}"
>>> "--with-jemalloc-prefix=je_arrow_"
>>> "--with-private-namespace=je_arrow_private_"
>>> "--without-export"
>>> "--disable-shared"
>>> # Don't override operator new()
>>> "--disable-cxx"
>>> "--disable-libdl"
>>> # See https://github.com/jemalloc/jemalloc/issues/1237
>>> "--disable-initial-exec-tls"
>>> ${EP_LOG_OPTIONS})
>>> list(APPEND
>>> ```
>>>
>>> On Tue, Jun 14, 2022 at 5:35 AM Weston Pace <we...@gmail.com>
>>> wrote:
>>> >
>>> > I can try and give a more detailed answer later in the week but the
>>> > gist of it is that Arrow manages all "buffer allocations" with a
>>> > memory pool.  These are the allocations for the actual data in the
>>> > arrays.  These are the allocations that use the memory pool configured
>>> > by ARROW_DEFAULT_MEMORY_POOL.
>>> >
>>> > To avoid interfering with the user's allocations Arrow does not
>>> > configure the system allocator at all.  So when Arrow builds it alters
>>> > it slightly (using cmake variables I think) to be specific to Arrow.
>>> > This might make it a bit tricky to get debug symbols for jemalloc but
>>> > you could always build Arrow in debug mode and intercept the methods
>>> > in memory_pool.cc if your focus is tracking allocations.
>>> >
>>> > Arrow still uses the system allocator for all non-buffer allocations.
>>> > So, for example, when reading in a large IPC file, the majority of the
>>> > data will be allocated by Arrow's memory pool.  However, the schema,
>>> > and the wrapper array object itself will be allocated by the system
>>> > allocator.  This is probably why switching the system allocator to
>>> > jemalloc shows some, but not all, Arrow allocations happening there.
>>> >
>>> > On Tue, Jun 14, 2022 at 5:28 AM John Muehlhausen <jg...@jgm.org> wrote:
>>> > >
>>> > > A code review has demonstrated that Arrow uses posix_memalign ... I
>>> do
>>> > > believe mimalloc preload is "catching" this but I didn't tool it
>>> with my
>>> > > customization.  Still interested in any guidance on the other points
>>> > > raised, and sorry for some of this being noise.
>>> > >
>>> > > -John
>>> > >
>>> > > On Tue, Jun 14, 2022 at 9:06 AM John Muehlhausen <jg...@jgm.org>
>>> wrote:
>>> > >
>>> > > > Hello,
>>> > > >
>>> > > > This comment is regarding installation with `apt` on ubuntu 18.04
>>> ...
>>> > > > `libarrow-dev/bionic,now 8.0.0-1 amd64`
>>> > > >
>>> > > > I'm a bit confused about the memory pool situation:
>>> > > >
>>> > > > * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
>>> > > > `arrow::default_memory_pool()->backend_name() ==
>>> > > > arrow::system_memory_pool()->backend_name()`
>>> > > >
>>> > > > * I then LD_PRELOAD a customized (*) mimalloc according to the
>>> directions
>>> > > > at the mimalloc git repo and things like `strm->Reset(INT32_MAX);`
>>> seem not
>>> > > > to be hitting it... I figured that is a big enough chunk to jostle
>>> it into
>>> > > > doing something... `BufferOutputStream::Create(INT32_MAX)` is also
>>> not
>>> > > > intercepted by mimalloc.  Is the "system" pool somehow going
>>> around the
>>> > > > typical allocation interfaces on linux?  I built my own .so and
>>> linked it
>>> > > > to the app and malloc() is getting intercepted.
>>> > > >
>>> > > > * `arrow::mimalloc_memory_pool(&mmmp);` does return something...
>>> but
>>> > > > apparently not "my" mimalloc ... statically linked?
>>> > > >
>>> > > > * what is going on in Arrow with constructor (pre-main())
>>> allocations?
>>> > > > Some of this does hit my LD_PRELOADed mimalloc
>>> > > >
>>> > > > * any way to get symbols for the apt-installed libs or would I
>>> need to
>>> > > > build from source to get backtrace with symbols? (for chasing down
>>> sources
>>> > > > of allocations)
>>> > > >
>>> > > > * what is the C++ lib equivalent of the following from the Python
>>> code?  I
>>> > > > figure I could stop trying to understand the built-in/default
>>> allocators if
>>> > > > I could just replace them... but this may also intersect with my
>>> question
>>> > > > about constructors.  Maybe I'd have to make sure my constructor
>>> runs first
>>> > > > to perform the switch-a-roo before anything else tries to use the
>>> default
>>> > > > pool?
>>> > > >
>>> > > > ```
>>> > > > namespace py {
>>> > > >
>>> > > > static std::mutex memory_pool_mutex;
>>> > > > static MemoryPool* default_python_pool = nullptr;
>>> > > >
>>> > > > void set_default_memory_pool(MemoryPool* pool) {
>>> > > >   std::lock_guard<std::mutex> guard(memory_pool_mutex);
>>> > > >   default_python_pool = pool;
>>> > > > }
>>> > > > ```
>>> > > >
>>> > > >
>>> > > > (*) the mimalloc customization: the main app has a weak reference
>>> that
>>> > > > ends up defined by the LD_PRELOAD mimalloc, where the function
>>> so-supplied
>>> > > > allows the app to install a function pointer (back to the main
>>> app) that
>>> > > > gets called (if defined) at various interesting points in mimalloc
>>> > > >
>>> > > >
>>> > > > Thanks,
>>> > > > John
>>> > > >
>>>
>>

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by John Muehlhausen <jg...@jgm.org>.
My best guess at this moment is that the Arrow lib I'm using was built with
a compiler that had something like __builtin_posix_memalign in effect ??

I say this because deploying __builtin_malloc has the same deleterious
effect on my own .so

On Tue, Jun 14, 2022 at 10:53 AM John Muehlhausen <jg...@jgm.org> wrote:

> I'm using ARROW_DEFAULT_MEMORY_POOL=system
>
> Based on a review of memory_pool.cc I expect this to become posix_memalign
> calls on Linux
>
> When I call posiix_memalign in a .so that I created and linked with my
> app, using LD_PRELOAD=/usr/local/lib/libmimalloc.so to run the app, these
> calls get forwarded to mi_posix_memalign (because I threw a prinf in there
> and re-built mimalloc)... note, I'm not talking about Arrow's built-in
> mimalloc.
>
> Maybe Arrow's mimalloc is keeping the LD_PRELOAD of my custom mimalloc
> from taking effect?  How is mimalloc included in Arrow?  When I
> call arrow::mimalloc_memory_pool() I do get an Ok status, so it is in the
> build I'm using from `apt`
>
> -John
>
> On Tue, Jun 14, 2022 at 10:37 AM Weston Pace <we...@gmail.com>
> wrote:
>
>> Sorry, that should have said "when Arrow builds jemalloc".  Here is
>> the command we send down (from ThirdPartyToolchain.cmake):
>>
>> ```
>> JEMALLOC_CONFIGURE_COMMAND
>> "--prefix=${JEMALLOC_PREFIX}"
>> "--libdir=${JEMALLOC_LIB_DIR}"
>> "--with-jemalloc-prefix=je_arrow_"
>> "--with-private-namespace=je_arrow_private_"
>> "--without-export"
>> "--disable-shared"
>> # Don't override operator new()
>> "--disable-cxx"
>> "--disable-libdl"
>> # See https://github.com/jemalloc/jemalloc/issues/1237
>> "--disable-initial-exec-tls"
>> ${EP_LOG_OPTIONS})
>> list(APPEND
>> ```
>>
>> On Tue, Jun 14, 2022 at 5:35 AM Weston Pace <we...@gmail.com>
>> wrote:
>> >
>> > I can try and give a more detailed answer later in the week but the
>> > gist of it is that Arrow manages all "buffer allocations" with a
>> > memory pool.  These are the allocations for the actual data in the
>> > arrays.  These are the allocations that use the memory pool configured
>> > by ARROW_DEFAULT_MEMORY_POOL.
>> >
>> > To avoid interfering with the user's allocations Arrow does not
>> > configure the system allocator at all.  So when Arrow builds it alters
>> > it slightly (using cmake variables I think) to be specific to Arrow.
>> > This might make it a bit tricky to get debug symbols for jemalloc but
>> > you could always build Arrow in debug mode and intercept the methods
>> > in memory_pool.cc if your focus is tracking allocations.
>> >
>> > Arrow still uses the system allocator for all non-buffer allocations.
>> > So, for example, when reading in a large IPC file, the majority of the
>> > data will be allocated by Arrow's memory pool.  However, the schema,
>> > and the wrapper array object itself will be allocated by the system
>> > allocator.  This is probably why switching the system allocator to
>> > jemalloc shows some, but not all, Arrow allocations happening there.
>> >
>> > On Tue, Jun 14, 2022 at 5:28 AM John Muehlhausen <jg...@jgm.org> wrote:
>> > >
>> > > A code review has demonstrated that Arrow uses posix_memalign ... I do
>> > > believe mimalloc preload is "catching" this but I didn't tool it with
>> my
>> > > customization.  Still interested in any guidance on the other points
>> > > raised, and sorry for some of this being noise.
>> > >
>> > > -John
>> > >
>> > > On Tue, Jun 14, 2022 at 9:06 AM John Muehlhausen <jg...@jgm.org> wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > This comment is regarding installation with `apt` on ubuntu 18.04
>> ...
>> > > > `libarrow-dev/bionic,now 8.0.0-1 amd64`
>> > > >
>> > > > I'm a bit confused about the memory pool situation:
>> > > >
>> > > > * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
>> > > > `arrow::default_memory_pool()->backend_name() ==
>> > > > arrow::system_memory_pool()->backend_name()`
>> > > >
>> > > > * I then LD_PRELOAD a customized (*) mimalloc according to the
>> directions
>> > > > at the mimalloc git repo and things like `strm->Reset(INT32_MAX);`
>> seem not
>> > > > to be hitting it... I figured that is a big enough chunk to jostle
>> it into
>> > > > doing something... `BufferOutputStream::Create(INT32_MAX)` is also
>> not
>> > > > intercepted by mimalloc.  Is the "system" pool somehow going around
>> the
>> > > > typical allocation interfaces on linux?  I built my own .so and
>> linked it
>> > > > to the app and malloc() is getting intercepted.
>> > > >
>> > > > * `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
>> > > > apparently not "my" mimalloc ... statically linked?
>> > > >
>> > > > * what is going on in Arrow with constructor (pre-main())
>> allocations?
>> > > > Some of this does hit my LD_PRELOADed mimalloc
>> > > >
>> > > > * any way to get symbols for the apt-installed libs or would I need
>> to
>> > > > build from source to get backtrace with symbols? (for chasing down
>> sources
>> > > > of allocations)
>> > > >
>> > > > * what is the C++ lib equivalent of the following from the Python
>> code?  I
>> > > > figure I could stop trying to understand the built-in/default
>> allocators if
>> > > > I could just replace them... but this may also intersect with my
>> question
>> > > > about constructors.  Maybe I'd have to make sure my constructor
>> runs first
>> > > > to perform the switch-a-roo before anything else tries to use the
>> default
>> > > > pool?
>> > > >
>> > > > ```
>> > > > namespace py {
>> > > >
>> > > > static std::mutex memory_pool_mutex;
>> > > > static MemoryPool* default_python_pool = nullptr;
>> > > >
>> > > > void set_default_memory_pool(MemoryPool* pool) {
>> > > >   std::lock_guard<std::mutex> guard(memory_pool_mutex);
>> > > >   default_python_pool = pool;
>> > > > }
>> > > > ```
>> > > >
>> > > >
>> > > > (*) the mimalloc customization: the main app has a weak reference
>> that
>> > > > ends up defined by the LD_PRELOAD mimalloc, where the function
>> so-supplied
>> > > > allows the app to install a function pointer (back to the main app)
>> that
>> > > > gets called (if defined) at various interesting points in mimalloc
>> > > >
>> > > >
>> > > > Thanks,
>> > > > John
>> > > >
>>
>

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by John Muehlhausen <jg...@jgm.org>.
I'm using ARROW_DEFAULT_MEMORY_POOL=system

Based on a review of memory_pool.cc I expect this to become posix_memalign
calls on Linux

When I call posiix_memalign in a .so that I created and linked with my app,
using LD_PRELOAD=/usr/local/lib/libmimalloc.so to run the app, these calls
get forwarded to mi_posix_memalign (because I threw a prinf in there and
re-built mimalloc)... note, I'm not talking about Arrow's built-in mimalloc.

Maybe Arrow's mimalloc is keeping the LD_PRELOAD of my custom mimalloc from
taking effect?  How is mimalloc included in Arrow?  When I
call arrow::mimalloc_memory_pool() I do get an Ok status, so it is in the
build I'm using from `apt`

-John

On Tue, Jun 14, 2022 at 10:37 AM Weston Pace <we...@gmail.com> wrote:

> Sorry, that should have said "when Arrow builds jemalloc".  Here is
> the command we send down (from ThirdPartyToolchain.cmake):
>
> ```
> JEMALLOC_CONFIGURE_COMMAND
> "--prefix=${JEMALLOC_PREFIX}"
> "--libdir=${JEMALLOC_LIB_DIR}"
> "--with-jemalloc-prefix=je_arrow_"
> "--with-private-namespace=je_arrow_private_"
> "--without-export"
> "--disable-shared"
> # Don't override operator new()
> "--disable-cxx"
> "--disable-libdl"
> # See https://github.com/jemalloc/jemalloc/issues/1237
> "--disable-initial-exec-tls"
> ${EP_LOG_OPTIONS})
> list(APPEND
> ```
>
> On Tue, Jun 14, 2022 at 5:35 AM Weston Pace <we...@gmail.com> wrote:
> >
> > I can try and give a more detailed answer later in the week but the
> > gist of it is that Arrow manages all "buffer allocations" with a
> > memory pool.  These are the allocations for the actual data in the
> > arrays.  These are the allocations that use the memory pool configured
> > by ARROW_DEFAULT_MEMORY_POOL.
> >
> > To avoid interfering with the user's allocations Arrow does not
> > configure the system allocator at all.  So when Arrow builds it alters
> > it slightly (using cmake variables I think) to be specific to Arrow.
> > This might make it a bit tricky to get debug symbols for jemalloc but
> > you could always build Arrow in debug mode and intercept the methods
> > in memory_pool.cc if your focus is tracking allocations.
> >
> > Arrow still uses the system allocator for all non-buffer allocations.
> > So, for example, when reading in a large IPC file, the majority of the
> > data will be allocated by Arrow's memory pool.  However, the schema,
> > and the wrapper array object itself will be allocated by the system
> > allocator.  This is probably why switching the system allocator to
> > jemalloc shows some, but not all, Arrow allocations happening there.
> >
> > On Tue, Jun 14, 2022 at 5:28 AM John Muehlhausen <jg...@jgm.org> wrote:
> > >
> > > A code review has demonstrated that Arrow uses posix_memalign ... I do
> > > believe mimalloc preload is "catching" this but I didn't tool it with
> my
> > > customization.  Still interested in any guidance on the other points
> > > raised, and sorry for some of this being noise.
> > >
> > > -John
> > >
> > > On Tue, Jun 14, 2022 at 9:06 AM John Muehlhausen <jg...@jgm.org> wrote:
> > >
> > > > Hello,
> > > >
> > > > This comment is regarding installation with `apt` on ubuntu 18.04 ...
> > > > `libarrow-dev/bionic,now 8.0.0-1 amd64`
> > > >
> > > > I'm a bit confused about the memory pool situation:
> > > >
> > > > * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
> > > > `arrow::default_memory_pool()->backend_name() ==
> > > > arrow::system_memory_pool()->backend_name()`
> > > >
> > > > * I then LD_PRELOAD a customized (*) mimalloc according to the
> directions
> > > > at the mimalloc git repo and things like `strm->Reset(INT32_MAX);`
> seem not
> > > > to be hitting it... I figured that is a big enough chunk to jostle
> it into
> > > > doing something... `BufferOutputStream::Create(INT32_MAX)` is also
> not
> > > > intercepted by mimalloc.  Is the "system" pool somehow going around
> the
> > > > typical allocation interfaces on linux?  I built my own .so and
> linked it
> > > > to the app and malloc() is getting intercepted.
> > > >
> > > > * `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
> > > > apparently not "my" mimalloc ... statically linked?
> > > >
> > > > * what is going on in Arrow with constructor (pre-main())
> allocations?
> > > > Some of this does hit my LD_PRELOADed mimalloc
> > > >
> > > > * any way to get symbols for the apt-installed libs or would I need
> to
> > > > build from source to get backtrace with symbols? (for chasing down
> sources
> > > > of allocations)
> > > >
> > > > * what is the C++ lib equivalent of the following from the Python
> code?  I
> > > > figure I could stop trying to understand the built-in/default
> allocators if
> > > > I could just replace them... but this may also intersect with my
> question
> > > > about constructors.  Maybe I'd have to make sure my constructor runs
> first
> > > > to perform the switch-a-roo before anything else tries to use the
> default
> > > > pool?
> > > >
> > > > ```
> > > > namespace py {
> > > >
> > > > static std::mutex memory_pool_mutex;
> > > > static MemoryPool* default_python_pool = nullptr;
> > > >
> > > > void set_default_memory_pool(MemoryPool* pool) {
> > > >   std::lock_guard<std::mutex> guard(memory_pool_mutex);
> > > >   default_python_pool = pool;
> > > > }
> > > > ```
> > > >
> > > >
> > > > (*) the mimalloc customization: the main app has a weak reference
> that
> > > > ends up defined by the LD_PRELOAD mimalloc, where the function
> so-supplied
> > > > allows the app to install a function pointer (back to the main app)
> that
> > > > gets called (if defined) at various interesting points in mimalloc
> > > >
> > > >
> > > > Thanks,
> > > > John
> > > >
>

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by Weston Pace <we...@gmail.com>.
Sorry, that should have said "when Arrow builds jemalloc".  Here is
the command we send down (from ThirdPartyToolchain.cmake):

```
JEMALLOC_CONFIGURE_COMMAND
"--prefix=${JEMALLOC_PREFIX}"
"--libdir=${JEMALLOC_LIB_DIR}"
"--with-jemalloc-prefix=je_arrow_"
"--with-private-namespace=je_arrow_private_"
"--without-export"
"--disable-shared"
# Don't override operator new()
"--disable-cxx"
"--disable-libdl"
# See https://github.com/jemalloc/jemalloc/issues/1237
"--disable-initial-exec-tls"
${EP_LOG_OPTIONS})
list(APPEND
```

On Tue, Jun 14, 2022 at 5:35 AM Weston Pace <we...@gmail.com> wrote:
>
> I can try and give a more detailed answer later in the week but the
> gist of it is that Arrow manages all "buffer allocations" with a
> memory pool.  These are the allocations for the actual data in the
> arrays.  These are the allocations that use the memory pool configured
> by ARROW_DEFAULT_MEMORY_POOL.
>
> To avoid interfering with the user's allocations Arrow does not
> configure the system allocator at all.  So when Arrow builds it alters
> it slightly (using cmake variables I think) to be specific to Arrow.
> This might make it a bit tricky to get debug symbols for jemalloc but
> you could always build Arrow in debug mode and intercept the methods
> in memory_pool.cc if your focus is tracking allocations.
>
> Arrow still uses the system allocator for all non-buffer allocations.
> So, for example, when reading in a large IPC file, the majority of the
> data will be allocated by Arrow's memory pool.  However, the schema,
> and the wrapper array object itself will be allocated by the system
> allocator.  This is probably why switching the system allocator to
> jemalloc shows some, but not all, Arrow allocations happening there.
>
> On Tue, Jun 14, 2022 at 5:28 AM John Muehlhausen <jg...@jgm.org> wrote:
> >
> > A code review has demonstrated that Arrow uses posix_memalign ... I do
> > believe mimalloc preload is "catching" this but I didn't tool it with my
> > customization.  Still interested in any guidance on the other points
> > raised, and sorry for some of this being noise.
> >
> > -John
> >
> > On Tue, Jun 14, 2022 at 9:06 AM John Muehlhausen <jg...@jgm.org> wrote:
> >
> > > Hello,
> > >
> > > This comment is regarding installation with `apt` on ubuntu 18.04 ...
> > > `libarrow-dev/bionic,now 8.0.0-1 amd64`
> > >
> > > I'm a bit confused about the memory pool situation:
> > >
> > > * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
> > > `arrow::default_memory_pool()->backend_name() ==
> > > arrow::system_memory_pool()->backend_name()`
> > >
> > > * I then LD_PRELOAD a customized (*) mimalloc according to the directions
> > > at the mimalloc git repo and things like `strm->Reset(INT32_MAX);` seem not
> > > to be hitting it... I figured that is a big enough chunk to jostle it into
> > > doing something... `BufferOutputStream::Create(INT32_MAX)` is also not
> > > intercepted by mimalloc.  Is the "system" pool somehow going around the
> > > typical allocation interfaces on linux?  I built my own .so and linked it
> > > to the app and malloc() is getting intercepted.
> > >
> > > * `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
> > > apparently not "my" mimalloc ... statically linked?
> > >
> > > * what is going on in Arrow with constructor (pre-main()) allocations?
> > > Some of this does hit my LD_PRELOADed mimalloc
> > >
> > > * any way to get symbols for the apt-installed libs or would I need to
> > > build from source to get backtrace with symbols? (for chasing down sources
> > > of allocations)
> > >
> > > * what is the C++ lib equivalent of the following from the Python code?  I
> > > figure I could stop trying to understand the built-in/default allocators if
> > > I could just replace them... but this may also intersect with my question
> > > about constructors.  Maybe I'd have to make sure my constructor runs first
> > > to perform the switch-a-roo before anything else tries to use the default
> > > pool?
> > >
> > > ```
> > > namespace py {
> > >
> > > static std::mutex memory_pool_mutex;
> > > static MemoryPool* default_python_pool = nullptr;
> > >
> > > void set_default_memory_pool(MemoryPool* pool) {
> > >   std::lock_guard<std::mutex> guard(memory_pool_mutex);
> > >   default_python_pool = pool;
> > > }
> > > ```
> > >
> > >
> > > (*) the mimalloc customization: the main app has a weak reference that
> > > ends up defined by the LD_PRELOAD mimalloc, where the function so-supplied
> > > allows the app to install a function pointer (back to the main app) that
> > > gets called (if defined) at various interesting points in mimalloc
> > >
> > >
> > > Thanks,
> > > John
> > >

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by Weston Pace <we...@gmail.com>.
I can try and give a more detailed answer later in the week but the
gist of it is that Arrow manages all "buffer allocations" with a
memory pool.  These are the allocations for the actual data in the
arrays.  These are the allocations that use the memory pool configured
by ARROW_DEFAULT_MEMORY_POOL.

To avoid interfering with the user's allocations Arrow does not
configure the system allocator at all.  So when Arrow builds it alters
it slightly (using cmake variables I think) to be specific to Arrow.
This might make it a bit tricky to get debug symbols for jemalloc but
you could always build Arrow in debug mode and intercept the methods
in memory_pool.cc if your focus is tracking allocations.

Arrow still uses the system allocator for all non-buffer allocations.
So, for example, when reading in a large IPC file, the majority of the
data will be allocated by Arrow's memory pool.  However, the schema,
and the wrapper array object itself will be allocated by the system
allocator.  This is probably why switching the system allocator to
jemalloc shows some, but not all, Arrow allocations happening there.

On Tue, Jun 14, 2022 at 5:28 AM John Muehlhausen <jg...@jgm.org> wrote:
>
> A code review has demonstrated that Arrow uses posix_memalign ... I do
> believe mimalloc preload is "catching" this but I didn't tool it with my
> customization.  Still interested in any guidance on the other points
> raised, and sorry for some of this being noise.
>
> -John
>
> On Tue, Jun 14, 2022 at 9:06 AM John Muehlhausen <jg...@jgm.org> wrote:
>
> > Hello,
> >
> > This comment is regarding installation with `apt` on ubuntu 18.04 ...
> > `libarrow-dev/bionic,now 8.0.0-1 amd64`
> >
> > I'm a bit confused about the memory pool situation:
> >
> > * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
> > `arrow::default_memory_pool()->backend_name() ==
> > arrow::system_memory_pool()->backend_name()`
> >
> > * I then LD_PRELOAD a customized (*) mimalloc according to the directions
> > at the mimalloc git repo and things like `strm->Reset(INT32_MAX);` seem not
> > to be hitting it... I figured that is a big enough chunk to jostle it into
> > doing something... `BufferOutputStream::Create(INT32_MAX)` is also not
> > intercepted by mimalloc.  Is the "system" pool somehow going around the
> > typical allocation interfaces on linux?  I built my own .so and linked it
> > to the app and malloc() is getting intercepted.
> >
> > * `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
> > apparently not "my" mimalloc ... statically linked?
> >
> > * what is going on in Arrow with constructor (pre-main()) allocations?
> > Some of this does hit my LD_PRELOADed mimalloc
> >
> > * any way to get symbols for the apt-installed libs or would I need to
> > build from source to get backtrace with symbols? (for chasing down sources
> > of allocations)
> >
> > * what is the C++ lib equivalent of the following from the Python code?  I
> > figure I could stop trying to understand the built-in/default allocators if
> > I could just replace them... but this may also intersect with my question
> > about constructors.  Maybe I'd have to make sure my constructor runs first
> > to perform the switch-a-roo before anything else tries to use the default
> > pool?
> >
> > ```
> > namespace py {
> >
> > static std::mutex memory_pool_mutex;
> > static MemoryPool* default_python_pool = nullptr;
> >
> > void set_default_memory_pool(MemoryPool* pool) {
> >   std::lock_guard<std::mutex> guard(memory_pool_mutex);
> >   default_python_pool = pool;
> > }
> > ```
> >
> >
> > (*) the mimalloc customization: the main app has a weak reference that
> > ends up defined by the LD_PRELOAD mimalloc, where the function so-supplied
> > allows the app to install a function pointer (back to the main app) that
> > gets called (if defined) at various interesting points in mimalloc
> >
> >
> > Thanks,
> > John
> >

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by John Muehlhausen <jg...@jgm.org>.
I take that back... the preload is not intercepting memory_pool.cc
-> SystemAllocator -> AllocateAligned -> posix_memalign (if indeed this is
the system allocator path), although it is intercepting posix_memalign from
a different .so

On Tue, Jun 14, 2022 at 10:27 AM John Muehlhausen <jg...@jgm.org> wrote:

> A code review has demonstrated that Arrow uses posix_memalign ... I do
> believe mimalloc preload is "catching" this but I didn't tool it with my
> customization.  Still interested in any guidance on the other points
> raised, and sorry for some of this being noise.
>
> -John
>
> On Tue, Jun 14, 2022 at 9:06 AM John Muehlhausen <jg...@jgm.org> wrote:
>
>> Hello,
>>
>> This comment is regarding installation with `apt` on ubuntu 18.04 ...
>> `libarrow-dev/bionic,now 8.0.0-1 amd64`
>>
>> I'm a bit confused about the memory pool situation:
>>
>> * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
>> `arrow::default_memory_pool()->backend_name() ==
>> arrow::system_memory_pool()->backend_name()`
>>
>> * I then LD_PRELOAD a customized (*) mimalloc according to the directions
>> at the mimalloc git repo and things like `strm->Reset(INT32_MAX);` seem not
>> to be hitting it... I figured that is a big enough chunk to jostle it into
>> doing something... `BufferOutputStream::Create(INT32_MAX)` is also not
>> intercepted by mimalloc.  Is the "system" pool somehow going around the
>> typical allocation interfaces on linux?  I built my own .so and linked it
>> to the app and malloc() is getting intercepted.
>>
>> * `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
>> apparently not "my" mimalloc ... statically linked?
>>
>> * what is going on in Arrow with constructor (pre-main()) allocations?
>> Some of this does hit my LD_PRELOADed mimalloc
>>
>> * any way to get symbols for the apt-installed libs or would I need to
>> build from source to get backtrace with symbols? (for chasing down sources
>> of allocations)
>>
>> * what is the C++ lib equivalent of the following from the Python code?
>> I figure I could stop trying to understand the built-in/default allocators
>> if I could just replace them... but this may also intersect with my
>> question about constructors.  Maybe I'd have to make sure my constructor
>> runs first to perform the switch-a-roo before anything else tries to use
>> the default pool?
>>
>> ```
>> namespace py {
>>
>> static std::mutex memory_pool_mutex;
>> static MemoryPool* default_python_pool = nullptr;
>>
>> void set_default_memory_pool(MemoryPool* pool) {
>>   std::lock_guard<std::mutex> guard(memory_pool_mutex);
>>   default_python_pool = pool;
>> }
>> ```
>>
>>
>> (*) the mimalloc customization: the main app has a weak reference that
>> ends up defined by the LD_PRELOAD mimalloc, where the function so-supplied
>> allows the app to install a function pointer (back to the main app) that
>> gets called (if defined) at various interesting points in mimalloc
>>
>>
>> Thanks,
>> John
>>
>

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

Posted by John Muehlhausen <jg...@jgm.org>.
A code review has demonstrated that Arrow uses posix_memalign ... I do
believe mimalloc preload is "catching" this but I didn't tool it with my
customization.  Still interested in any guidance on the other points
raised, and sorry for some of this being noise.

-John

On Tue, Jun 14, 2022 at 9:06 AM John Muehlhausen <jg...@jgm.org> wrote:

> Hello,
>
> This comment is regarding installation with `apt` on ubuntu 18.04 ...
> `libarrow-dev/bionic,now 8.0.0-1 amd64`
>
> I'm a bit confused about the memory pool situation:
>
> * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that
> `arrow::default_memory_pool()->backend_name() ==
> arrow::system_memory_pool()->backend_name()`
>
> * I then LD_PRELOAD a customized (*) mimalloc according to the directions
> at the mimalloc git repo and things like `strm->Reset(INT32_MAX);` seem not
> to be hitting it... I figured that is a big enough chunk to jostle it into
> doing something... `BufferOutputStream::Create(INT32_MAX)` is also not
> intercepted by mimalloc.  Is the "system" pool somehow going around the
> typical allocation interfaces on linux?  I built my own .so and linked it
> to the app and malloc() is getting intercepted.
>
> * `arrow::mimalloc_memory_pool(&mmmp);` does return something... but
> apparently not "my" mimalloc ... statically linked?
>
> * what is going on in Arrow with constructor (pre-main()) allocations?
> Some of this does hit my LD_PRELOADed mimalloc
>
> * any way to get symbols for the apt-installed libs or would I need to
> build from source to get backtrace with symbols? (for chasing down sources
> of allocations)
>
> * what is the C++ lib equivalent of the following from the Python code?  I
> figure I could stop trying to understand the built-in/default allocators if
> I could just replace them... but this may also intersect with my question
> about constructors.  Maybe I'd have to make sure my constructor runs first
> to perform the switch-a-roo before anything else tries to use the default
> pool?
>
> ```
> namespace py {
>
> static std::mutex memory_pool_mutex;
> static MemoryPool* default_python_pool = nullptr;
>
> void set_default_memory_pool(MemoryPool* pool) {
>   std::lock_guard<std::mutex> guard(memory_pool_mutex);
>   default_python_pool = pool;
> }
> ```
>
>
> (*) the mimalloc customization: the main app has a weak reference that
> ends up defined by the LD_PRELOAD mimalloc, where the function so-supplied
> allows the app to install a function pointer (back to the main app) that
> gets called (if defined) at various interesting points in mimalloc
>
>
> Thanks,
> John
>