You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Yujiang Zhong (Jira)" <ji...@apache.org> on 2022/06/16 13:21:00 UTC
[jira] [Commented] (PARQUET-2160) Close decompression stream to free off-heap memory in time
[ https://issues.apache.org/jira/browse/PARQUET-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555082#comment-17555082 ]
Yujiang Zhong commented on PARQUET-2160:
----------------------------------------
[~shangxinli] [~dongjoon] Can you please take a look at this?
> Close decompression stream to free off-heap memory in time
> ----------------------------------------------------------
>
> Key: PARQUET-2160
> URL: https://issues.apache.org/jira/browse/PARQUET-2160
> Project: Parquet
> Issue Type: Improvement
> Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 1.4.9.1 + glibc
> Reporter: Yujiang Zhong
> Priority: Major
>
> The decompressed stream now relies on the JVM GC to close. When reading parquet in zstd compressed format, sometimes I ran into OOM cause high off-heap usage. I think the reason is that the GC is not timely and causes off-heap memory fragmentation. I had to set lower MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. There is a [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789&cid=C025PH0G1D4]] of this zstd parquet issus in Iceberg community slack: some people had the same problem.
> I think we can close decompressed stream mannually in time to solve this problem:
>
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.from(is, uncompressedSize);
> ->
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.{_}copy{_}(BytesInput.{_}from{_}(is, uncompressedSize));
> is.close();
>
> After I made this change to decompress, I found off-heap memory is significantly reduced (with same query on same data).
--
This message was sent by Atlassian Jira
(v8.20.7#820007)