You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Fengyu Cao (Jira)" <ji...@apache.org> on 2022/08/30 08:43:00 UTC

[jira] [Commented] (SPARK-39763) Executor memory footprint substantially increases while reading zstd compressed parquet files

    [ https://issues.apache.org/jira/browse/SPARK-39763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597685#comment-17597685 ] 

Fengyu Cao commented on SPARK-39763:
------------------------------------

had the same problem

 

one of our dataset, 75GB in zstd parquet(134G in snappy)
{code:java}
# 10 executor
# Executor Reqs: memoryOverhead: [amount: 3072] cores: [amount: 4] memory: [amount: 10240] offHeap: [amount: 4096] Task Reqs: cpus: [amount: 1.0]
 
df = spark.read.parquet("dataset_zstd")  # with spark.sql.parquet.enableVectorizedReader=false
df.write.mode("overwrite").format("noop").save()
{code}
task failed with OOM, but with dataset in snappy, everything is fine

 

> Executor memory footprint substantially increases while reading zstd compressed parquet files
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-39763
>                 URL: https://issues.apache.org/jira/browse/SPARK-39763
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: Yeachan Park
>            Priority: Minor
>
> Hi all,
>  
> While transitioning from the default snappy compression to zstd, we noticed a substantial increase in executor memory whilst *reading* and applying transformations on *zstd* compressed parquet files.
> Memory footprint increased increased 3 fold in some cases, compared to reading and applying the same transformations on a parquet file compressed with snappy.
> This behaviour only occurs when reading zstd compressed parquet files. Writing a zstd parquet file does not result in this behaviour.
> To reproduce:
>  # Set "spark.sql.parquet.compression.codec" to zstd
>  # Write some parquet files, the compression will default to zstd after setting the option above
>  # Read the compressed zstd file and run some transformations. Compare the memory usage of the executor vs running the same transformation on a parquet file with snappy compression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org