You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lantao Jin (JIRA)" <ji...@apache.org> on 2019/05/05 09:48:00 UTC
[jira] [Created] (SPARK-27635) Prevent from splitting too many
partitions smaller than row group size in Parquet file format
Lantao Jin created SPARK-27635:
----------------------------------
Summary: Prevent from splitting too many partitions smaller than row group size in Parquet file format
Key: SPARK-27635
URL: https://issues.apache.org/jira/browse/SPARK-27635
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.4.2, 3.0.0
Reporter: Lantao Jin
The scenario is submitting multiple jobs concurrently with spark dynamic allocation enabled. The issue happens in determining RDD partition numbers. When there are more available CPU cores, spark will try to split RDD to more pieces. But since the file is stored as parquet format, parquet's row group is actually the basic unit block to read data. Splitting RDD to too many small pieces doesn't make sense.
Jobs will launch too many partitions and never complete.
Set the default parallelism to a fixed number (for example 200) will fix this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org