You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2020/08/21 07:53:33 UTC
[spark] branch branch-3.0 updated: [SPARK-32674][DOC] Add
suggestion for parallel directory listing in tuning doc
This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.0 by this push:
new a5f4230 [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc
a5f4230 is described below
commit a5f4230f619673da882da01b5d6acd9ba20a79e3
Author: Chao Sun <su...@apache.org>
AuthorDate: Fri Aug 21 16:48:54 2020 +0900
[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc
### What changes were proposed in this pull request?
This adds some tuning guide for increasing parallelism of directory listing.
### Why are the changes needed?
Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes #29498 from sunchao/SPARK-32674.
Authored-by: Chao Sun <su...@apache.org>
Signed-off-by: HyukjinKwon <gu...@apache.org>
(cherry picked from commit bf221debd02b11003b092201d0326302196e4ba5)
Signed-off-by: HyukjinKwon <gu...@apache.org>
---
docs/sql-performance-tuning.md | 22 ++++++++++++++++++++++
docs/tuning.md | 11 +++++++++++
2 files changed, 33 insertions(+)
diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
index 5e6f049..5d8c3b6 100644
--- a/docs/sql-performance-tuning.md
+++ b/docs/sql-performance-tuning.md
@@ -114,6 +114,28 @@ that these options will be deprecated in future release as more optimizations ar
</td>
<td>1.1.0</td>
</tr>
+ <tr>
+ <td><code>spark.sql.sources.parallelPartitionDiscovery.threshold</code></td>
+ <td>32</td>
+ <td>
+ Configures the threshold to enable parallel listing for job input paths. If the number of
+ input paths is larger than this threshold, Spark will list the files by using Spark distributed job.
+ Otherwise, it will fallback to sequential listing. This configuration is only effective when
+ using file-based data sources such as Parquet, ORC and JSON.
+ </td>
+ <td>1.5.0</td>
+ </tr>
+ <tr>
+ <td><code>spark.sql.sources.parallelPartitionDiscovery.parallelism</code></td>
+ <td>10000</td>
+ <td>
+ Configures the maximum listing parallelism for job input paths. In case the number of input
+ paths is larger than this value, it will be throttled down to use this value. Same as above,
+ this configuration is only effective when using file-based data sources such as Parquet, ORC
+ and JSON.
+ </td>
+ <td>2.1.1</td>
+ </tr>
</table>
## Join Strategy Hints for SQL Queries
diff --git a/docs/tuning.md b/docs/tuning.md
index 8e29e5d..18d4a62 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -264,6 +264,17 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se
or set the config property `spark.default.parallelism` to change the default.
In general, we recommend 2-3 tasks per CPU core in your cluster.
+## Parallel Listing on Input Paths
+
+Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
+otherwise the process could take a very long time, especially when against object store like S3.
+If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is
+controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1).
+
+For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and
+`spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please
+refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details.
+
## Memory Usage of Reduce Tasks
Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org