You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by ZHOU Yuan <du...@gmail.com> on 2021/06/18 01:47:56 UTC

Question on releasing record batch

Hi Arrow developers,

ran into a memory footprint issue after releasing the record batch
manually. The logic of my program is:
0. read many record batches
1. process on these batches
2. dump the intermediate results on disk
3. close the batches
4. logics for other operations

I expect the memory footprint will drop after stage #3, however it looks
like the memory is not released.
I then write a small test program to check the behavior. Running with GDB
the de-constructor of recordbatch
is indeedly called in the "input_batch.reset()", but the memory is not
released until I cancel the whole program.

I understand the lifetime of recodrbatch is controlled by # of owners of
shared_ptr, so it will be released eventually,
but are there any APIs or ways to release it manually in the middle of my
program?

attached is the testing code snip. Thanks!

=======
  auto f0 = field("f0", float64());
  auto f1 = field("f1", uint32());
  auto sch = arrow::schema({f0, f1});

  std::vector<std::string> input_data_string = {
"[10, NaN, 4, 50, 52, 32, 11]",

"[11, 13, 5, 51, null, 33, 12]"};


  // prepare input record Batch
  std::vector<std::shared_ptr<Array>> array_list;
  int length = -1;
  int i = 0;
  for (auto data : input_data_string) {
    std::shared_ptr<Array> a0;
    ASSERT_NOT_OK(arrow::ipc::internal::json::ArrayFromJSON(sch->field(i++)->type(),
                                                            data.c_str(), &a0));
    if (length == -1) {
      length = a0->length();
    }
    assert(length == a0->length());
    array_list.push_back(a0);
  }

  auto input_batch = RecordBatch::Make(sch, length, std::move(array_list));

  input_batch.reset(); // should be free here?
  std::this_thread::sleep_for(std::chrono::seconds(20));
thanks, -yuan

Re: Question on releasing record batch

Posted by ZHOU Yuan <du...@gmail.com>.
Hi Weston, thank you for the inputs!

I was watching the memory usage using "top -p `pidof test`", the size of
the residence memory is not reduced.

With the new counter I saw the memory is freed immediately on arrow side.
So this is related to my allocator.
I actually disbaled jemalloc/mimalloc during arrow build but didn't realize
the glibc allocator will also have similar behavior.
I'll try to do more debugging on the allocator side then.

Thanks again!

thanks, -yuan


On Fri, Jun 18, 2021 at 10:21 AM Weston Pace <we...@gmail.com> wrote:

> The only owner of input_batch that I can see here is the shared_ptr
> that you are resetting so I would expect the memory to be freed.
>
> How are you measuring memory usage?  The dynamic allocators (mimalloc
> / jemalloc) don't always release memory as soon as they possibly can.
> Even malloc will sometimes be forced to hang onto memory due to
> fragmentation issues, etc.  Can you try measuring memory usage with
> arrow::default_memory_pool()->bytes_allocated(); ?
>
> On Thu, Jun 17, 2021 at 3:48 PM ZHOU Yuan <du...@gmail.com> wrote:
> >
> > Hi Arrow developers,
> >
> > ran into a memory footprint issue after releasing the record batch
> > manually. The logic of my program is:
> > 0. read many record batches
> > 1. process on these batches
> > 2. dump the intermediate results on disk
> > 3. close the batches
> > 4. logics for other operations
> >
> > I expect the memory footprint will drop after stage #3, however it looks
> > like the memory is not released.
> > I then write a small test program to check the behavior. Running with GDB
> > the de-constructor of recordbatch
> > is indeedly called in the "input_batch.reset()", but the memory is not
> > released until I cancel the whole program.
> >
> > I understand the lifetime of recodrbatch is controlled by # of owners of
> > shared_ptr, so it will be released eventually,
> > but are there any APIs or ways to release it manually in the middle of my
> > program?
> >
> > attached is the testing code snip. Thanks!
> >
> > =======
> >   auto f0 = field("f0", float64());
> >   auto f1 = field("f1", uint32());
> >   auto sch = arrow::schema({f0, f1});
> >
> >   std::vector<std::string> input_data_string = {
> > "[10, NaN, 4, 50, 52, 32, 11]",
> >
> > "[11, 13, 5, 51, null, 33, 12]"};
> >
> >
> >   // prepare input record Batch
> >   std::vector<std::shared_ptr<Array>> array_list;
> >   int length = -1;
> >   int i = 0;
> >   for (auto data : input_data_string) {
> >     std::shared_ptr<Array> a0;
> >
>  ASSERT_NOT_OK(arrow::ipc::internal::json::ArrayFromJSON(sch->field(i++)->type(),
> >
>  data.c_str(), &a0));
> >     if (length == -1) {
> >       length = a0->length();
> >     }
> >     assert(length == a0->length());
> >     array_list.push_back(a0);
> >   }
> >
> >   auto input_batch = RecordBatch::Make(sch, length,
> std::move(array_list));
> >
> >   input_batch.reset(); // should be free here?
> >   std::this_thread::sleep_for(std::chrono::seconds(20));
> > thanks, -yuan
>

Re: Question on releasing record batch

Posted by Weston Pace <we...@gmail.com>.
The only owner of input_batch that I can see here is the shared_ptr
that you are resetting so I would expect the memory to be freed.

How are you measuring memory usage?  The dynamic allocators (mimalloc
/ jemalloc) don't always release memory as soon as they possibly can.
Even malloc will sometimes be forced to hang onto memory due to
fragmentation issues, etc.  Can you try measuring memory usage with
arrow::default_memory_pool()->bytes_allocated(); ?

On Thu, Jun 17, 2021 at 3:48 PM ZHOU Yuan <du...@gmail.com> wrote:
>
> Hi Arrow developers,
>
> ran into a memory footprint issue after releasing the record batch
> manually. The logic of my program is:
> 0. read many record batches
> 1. process on these batches
> 2. dump the intermediate results on disk
> 3. close the batches
> 4. logics for other operations
>
> I expect the memory footprint will drop after stage #3, however it looks
> like the memory is not released.
> I then write a small test program to check the behavior. Running with GDB
> the de-constructor of recordbatch
> is indeedly called in the "input_batch.reset()", but the memory is not
> released until I cancel the whole program.
>
> I understand the lifetime of recodrbatch is controlled by # of owners of
> shared_ptr, so it will be released eventually,
> but are there any APIs or ways to release it manually in the middle of my
> program?
>
> attached is the testing code snip. Thanks!
>
> =======
>   auto f0 = field("f0", float64());
>   auto f1 = field("f1", uint32());
>   auto sch = arrow::schema({f0, f1});
>
>   std::vector<std::string> input_data_string = {
> "[10, NaN, 4, 50, 52, 32, 11]",
>
> "[11, 13, 5, 51, null, 33, 12]"};
>
>
>   // prepare input record Batch
>   std::vector<std::shared_ptr<Array>> array_list;
>   int length = -1;
>   int i = 0;
>   for (auto data : input_data_string) {
>     std::shared_ptr<Array> a0;
>     ASSERT_NOT_OK(arrow::ipc::internal::json::ArrayFromJSON(sch->field(i++)->type(),
>                                                             data.c_str(), &a0));
>     if (length == -1) {
>       length = a0->length();
>     }
>     assert(length == a0->length());
>     array_list.push_back(a0);
>   }
>
>   auto input_batch = RecordBatch::Make(sch, length, std::move(array_list));
>
>   input_batch.reset(); // should be free here?
>   std::this_thread::sleep_for(std::chrono::seconds(20));
> thanks, -yuan