You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/12/07 03:42:00 UTC
[jira] [Commented] (ARROW-14987) [C++]Memory leak while reading parquet file

    [ https://issues.apache.org/jira/browse/ARROW-14987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454354#comment-17454354 ] 

Weston Pace commented on ARROW-14987:
-------------------------------------

*TL:DR; A chunk_size of 3 is way too low.*

Thank you so much for the detailed reproduction.

# Some notes

First, I used 5 times the amount of data that you were working with.  This works out to 12.5MB of int64_t "data"

Second, you are not releasing the variable named "table" in your main method.  This holds on to 12.5MB of RAM.  I added table.reset() before the sleep to take care of this.

Third, a chunk size of 3 is pathologically small. This means parquet is going to have to write row group metadata after every 3 rows of data.  As a result, the parquet file, which only contains 12.5MB of real data, requires 169MB.  This means there is ~157MB of metadata.  A chunk size should, at a minimum, be in the tens of thousands, and often is in the millions.

*When I run this test I end up with nearly 1GB of memory usage!  Even given the erroneously large parquet file this seems like way too much*

# Figuring out Arrow memory pool usage

One helpful tool when determining how much RAM Arrow is using is to print out how many bytes Arrow thinks it is holding onto.  To do this you can add...

{noformat}
std::cout << arrow::default_memory_pool()->bytes_allocated() << " bytes_allocated" << std::endl;
{noformat}

Assuming you add the "table.reset()" call this should print "0 bytes allocated" which means that Arrow is not holding on to any memory.

The second common thing to get blamed is jemalloc.  Arrow uses jemalloc (or possibly mimalloc) internally in its memory pools and these allocators sometimes over-allocate and sometimes hold onto memory for a little while.  However, this seems unlikely because jemalloc is configured by default by Arrow to release over-allocated memory every 1 second.

To verify I built an instrumented version of Arrow to print stats for its internal jemalloc pool after 5 seconds of being idle.  I got:

{noformat}
Allocated: 29000, active: 45056, metadata: 6581448 (n_thp 0), resident: 6606848, mapped: 12627968, retained: 125259776
{noformat}

This means Arrow has 29KB of data actively allocated (this is curious, given bytes_allocated is 0, and worth investigation at a later date, but certainly not the culprit here).

That 29KB of active data spans 45.056KB of pages (this is what people refer to when they talk about fragmentation).  There is also 6.58MB of jemalloc metadata.  I'm pretty sure this is rather independent of the workload and not something to worry too much about.

Combined, this 45.056KB of data and 6.58MB of metadata is occupying 6.61MB of RSS.  So far so good.

# Figuring out the rest of the memory usage

There is only one other place the remaining memory usage can be, which is the application's global system allocator.  To debug this further I built my test application with jemalloc (a different jemalloc instance than the one running Arrow).  This means Arrow's memory pool will use one instance of jemalloc and everything else will use my own instance of jemalloc.  Printing stats I get:

{noformat}
Allocated: 257904, active: 569344, metadata: 15162288 (n_thp 0), resident: 950906880, mapped: 958836736, retained: 648630272
{noformat}

Now we have found our culprit.  There is about 258KB allocated and it occupied 569KB worth of pages and 15MB of jemalloc metadata.  This is pretty reasonable and makes sense (this is memory used by shared pointers and various metadata objects.  It seems pretty appropriate.

_However, this ~15MB of data is occupying nearly 1GB of RSS!_

To debug further I used jemalloc's memory profiling to track where all of these allocations were happening.  It turns out most of these allocations were in the parquet reader itself.  While the table built will eventually be constructed in Arrow's memory pool the parquet reader does not use the memory pool for the various allocations needed to operate the reader itself.

So, putting this all together into a hypothesis...

The chunk size of 3 means we have a ton of metadata.  This metadata gets allocated by the parquet reader in lots of very small allocations.  These allocations have terrible fragmentation and the system allocator ends up scattering this information across a wide swath of RSS and results in a large amount of over-allocation.

# Fixes

## Fix 1: Use more jemalloc

Since my test was already using jemalloc I can configure jemalloc the same way Arrow does by enabling the background thread and setting it to purge on a 1 second interval.  Now, running my test, after 5 seconds of inactivity I get the following from the global jemalloc:

{noformat}
Allocated: 246608, active: 544768, metadata: 15155760 (n_thp 0), resident: 15675392, mapped: 23613440, retained: 1382526976
{noformat}

We now see that same ~15MB of data and jemalloc metadata is now spread across 15.6MB of RSS (pretty great fragmentation support).  I can confirm this by looking at the RSS of the process which reports 25MB (most of which is explained by the two jemalloc instance's metadata) which is a massive improvement over 1GB.

## Fix 2: Use a sane chunk size

If I change the chunk size to 100,000 then suddenly parquet is not making so many tiny allocations (my program runs much faster) and I get the following stats for the global jemalloc instance:

{noformat}
Allocated: 1756168, active: 2027520, metadata: 4492600 (n_thp 0), resident: 6496256, mapped: 8318976, retained: 64557056
{noformat}

And I see only 18.5MB of RSS usage.

> [C++]Memory leak while reading parquet file
> -------------------------------------------
>
>                 Key: ARROW-14987
>                 URL: https://issues.apache.org/jira/browse/ARROW-14987
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 6.0.1
>            Reporter: Qingxiang Chen
>            Priority: Major
>
> When I used parquet to access data, I found that the memory usage was still high after the function ended. I reproduced this problem in the example. code show as below:
>  
> {code:c++}
> #include <arrow/api.h>
> #include <arrow/io/api.h>
> #include <parquet/arrow/reader.h>
> #include <parquet/arrow/writer.h>
> #include <parquet/exception.h>
> #include <unistd.h>
> #include <iostream>
> std::shared_ptr<arrow::Table> generate_table() {
>   arrow::Int64Builder i64builder;
>   for (int i=0;i<320000;i++){
> 	  i64builder.Append(i);
>   }
>   std::shared_ptr<arrow::Array> i64array;
>   PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array));
>   std::shared_ptr<arrow::Schema> schema = arrow::schema(
>       {arrow::field("int", arrow::int64())});
>   return arrow::Table::Make(schema, {i64array});
> }
> void write_parquet_file(const arrow::Table& table) {
>   std::shared_ptr<arrow::io::FileOutputStream> outfile;
>   PARQUET_ASSIGN_OR_THROW(
>       outfile, arrow::io::FileOutputStream::Open("parquet-arrow-example.parquet"));
>   PARQUET_THROW_NOT_OK(
>       parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
> }
> void read_whole_file() {
>   std::cout << "Reading parquet-arrow-example.parquet at once" << std::endl;
>   std::shared_ptr<arrow::io::ReadableFile> infile;
>   PARQUET_ASSIGN_OR_THROW(infile,
>                           arrow::io::ReadableFile::Open("parquet-arrow-example.parquet",
>                                                         arrow::default_memory_pool()));
>   std::unique_ptr<parquet::arrow::FileReader> reader;
>   PARQUET_THROW_NOT_OK(
>       parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader));
>   std::shared_ptr<arrow::Table> table;
>   PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
>   std::cout << "Loaded " << table->num_rows() << " rows in " << table->num_columns()
>             << " columns." << std::endl;
> }
> int main(int argc, char** argv) {
>   std::shared_ptr<arrow::Table> table = generate_table();
>   write_parquet_file(*table);
>   std::cout << "start " <<std::endl;
>   read_whole_file();
>   std::cout << "end " <<std::endl;
>   sleep(100);
> }
> {code}
> After the end, during sleep, the memory usage is still more than 100M and has not dropped. When I increase the data volume by 5 times, the memory usage is about 500M, and it will not drop.
> I want to know whether this part of the data is cached by the memory pool, or whether it is a memory leak problem. If there is no memory leak, how to set memory pool size or release memory?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)