You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/05/04 02:47:00 UTC
[jira] [Created] (ARROW-16451) [C++] ParquetFileFragment caches parquet file metadata and there is no way to disable this
Weston Pace created ARROW-16451:
-----------------------------------
Summary: [C++] ParquetFileFragment caches parquet file metadata and there is no way to disable this
Key: ARROW-16451
URL: https://issues.apache.org/jira/browse/ARROW-16451
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
When looking at ARROW-15081 there was a strange amount of memory used even when we were accumulating all of the results into a single 64 byte counter (e.g. {{SELECT COUNT(*) FROM table}}).
It turns out this was the parquet metadata, which gets attached to the parquet file fragment. There is no way to prevent this and, in this case, it was using quite a bit of RAM. There were 1100 files and each file had ~10MB of metadata.
We should have an option for disabling this. Also, this should probably be off by default. It can be a useful thing to cache if you are going to run the same dataset again and again but otherwise it is just wasted RAM.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)