You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/05/04 02:47:00 UTC

[jira] [Created] (ARROW-16451) [C++] ParquetFileFragment caches parquet file metadata and there is no way to disable this

Weston Pace created ARROW-16451:
-----------------------------------

             Summary: [C++] ParquetFileFragment caches parquet file metadata and there is no way to disable this
                 Key: ARROW-16451
                 URL: https://issues.apache.org/jira/browse/ARROW-16451
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


When looking at ARROW-15081 there was a strange amount of memory used even when we were accumulating all of the results into a single 64 byte counter (e.g. {{SELECT COUNT(*) FROM table}}).

It turns out this was the parquet metadata, which gets attached to the parquet file fragment.  There is no way to prevent this and, in this case, it was using quite a bit of RAM.  There were 1100 files and each file had ~10MB of metadata.

We should have an option for disabling this.  Also, this should probably be off by default.  It can be a useful thing to cache if you are going to run the same dataset again and again but otherwise it is just wasted RAM.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)