You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2022/11/07 00:04:22 UTC
[spark] branch master updated: [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar
This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new c4d159a368d [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar
c4d159a368d is described below
commit c4d159a368d554a8567271dbfec8f291d1de70a5
Author: Dustin William Smith <du...@deliveryhero.com>
AuthorDate: Sun Nov 6 18:04:10 2022 -0600
[MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar
### What changes were proposed in this pull request?
I made some small grammar fixes related to dependent clause followed but independent clauses, starting a sentence with an introductory phrase, using the plural with when are is present in the sentence, and other small fixes to improve readability.
https://spark.apache.org/docs/latest/sql-performance-tuning.html
<img width="1065" alt="Screenshot 2022-11-04 at 15 24 17" src="https://user-images.githubusercontent.com/7563201/199998862-d9418bc1-2fcd-4eff-be8e-af412add6946.png">
### Why are the changes needed?
These changes improve the readability of the Spark documentation for new users or those studying up.
### Does this PR introduce _any_ user-facing change?
Yes, these changes impact the spark documentation.
### How was this patch tested?
No test were created as these changes were solely in markdown.
Closes #38510 from dwsmith1983/minor-doc-revisions.
Lead-authored-by: Dustin William Smith <du...@deliveryhero.com>
Co-authored-by: dustin <dw...@users.noreply.github.com>
Co-authored-by: Dustin Smith <Du...@gmail.com>
Signed-off-by: Sean Owen <sr...@gmail.com>
---
docs/sql-performance-tuning.md | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
index d736ff8f83f..6ac39d90527 100644
--- a/docs/sql-performance-tuning.md
+++ b/docs/sql-performance-tuning.md
@@ -40,7 +40,7 @@ Configuration of in-memory caching can be done using the `setConf` method on `Sp
<td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
<td>true</td>
<td>
- When set to true Spark SQL will automatically select a compression codec for each column based
+ When set to true, Spark SQL will automatically select a compression codec for each column based
on statistics of the data.
</td>
<td>1.0.1</td>
@@ -77,8 +77,8 @@ that these options will be deprecated in future release as more optimizations ar
<td><code>spark.sql.files.openCostInBytes</code></td>
<td>4194304 (4 MB)</td>
<td>
- The estimated cost to open a file, measured by the number of bytes could be scanned in the same
- time. This is used when putting multiple files into a partition. It is better to over-estimated,
+ The estimated cost to open a file, measured by the number of bytes that could be scanned in the same
+ time. This is used when putting multiple files into a partition. It is better to over-estimate,
then the partitions with small files will be faster than partitions with bigger files (which is
scheduled first). This configuration is effective only when using file-based sources such as Parquet,
JSON and ORC.
@@ -110,7 +110,7 @@ that these options will be deprecated in future release as more optimizations ar
<td>10485760 (10 MB)</td>
<td>
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
- performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently
+ performing a join. By setting this value to -1, broadcasting can be disabled. Note that currently
statistics are only supported for Hive Metastore tables where the command
<code>ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan</code> has been run.
</td>
@@ -140,8 +140,7 @@ that these options will be deprecated in future release as more optimizations ar
<td>10000</td>
<td>
Configures the maximum listing parallelism for job input paths. In case the number of input
- paths is larger than this value, it will be throttled down to use this value. Same as above,
- this configuration is only effective when using file-based data sources such as Parquet, ORC
+ paths is larger than this value, it will be throttled down to use this value. This configuration is only effective when using file-based data sources such as Parquet, ORC
and JSON.
</td>
<td>2.1.1</td>
@@ -215,8 +214,8 @@ For more details please refer to the documentation of [Join Hints](sql-ref-synta
## Coalesce Hints for SQL Queries
-Coalesce hints allows the Spark SQL users to control the number of output files just like the
-`coalesce`, `repartition` and `repartitionByRange` in Dataset API, they can be used for performance
+Coalesce hints allow Spark SQL users to control the number of output files just like
+`coalesce`, `repartition` and `repartitionByRange` in the Dataset API, they can be used for performance
tuning and reducing the number of output files. The "COALESCE" hint only has a partition number as a
parameter. The "REPARTITION" hint has a partition number, columns, or both/neither of them as parameters.
The "REPARTITION_BY_RANGE" hint must have column names and a partition number is optional. The "REBALANCE"
@@ -295,7 +294,7 @@ AQE converts sort-merge join to broadcast hash join when the runtime statistics
<td><code>spark.sql.adaptive.autoBroadcastJoinThreshold</code></td>
<td>(none)</td>
<td>
- Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. The default value is same with <code>spark.sql.autoBroadcastJoinThreshold</code>. Note that, this config is used only in adaptive framework.
+ Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1, broadcasting can be disabled. The default value is the same as <code>spark.sql.autoBroadcastJoinThreshold</code>. Note that, this config is used only in adaptive framework.
</td>
<td>3.2.0</td>
</tr>
@@ -309,7 +308,7 @@ AQE converts sort-merge join to shuffled hash join when all post shuffle partiti
<td><code>spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold</code></td>
<td>0</td>
<td>
- Configures the maximum size in bytes per partition that can be allowed to build local hash map. If this value is not smaller than <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code> and all the partition size are not larger than this config, join selection prefer to use shuffled hash join instead of sort merge join regardless of the value of <code>spark.sql.join.preferSortMergeJoin</code>.
+ Configures the maximum size in bytes per partition that can be allowed to build local hash map. If this value is not smaller than <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code> and all the partition sizes are not larger than this config, join selection prefers to use shuffled hash join instead of sort merge join regardless of the value of <code>spark.sql.join.preferSortMergeJoin</code>.
</td>
<td>3.2.0</td>
</tr>
@@ -339,7 +338,7 @@ Data skew can severely downgrade the performance of join queries. This feature d
<td><code>spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes</code></td>
<td>256MB</td>
<td>
- A partition is considered as skewed if its size in bytes is larger than this threshold and also larger than <code>spark.sql.adaptive.skewJoin.skewedPartitionFactor</code> multiplying the median partition size. Ideally this config should be set larger than <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>.
+ A partition is considered as skewed if its size in bytes is larger than this threshold and also larger than <code>spark.sql.adaptive.skewJoin.skewedPartitionFactor</code> multiplying the median partition size. Ideally, this config should be set larger than <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>.
</td>
<td>3.0.0</td>
</tr>
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org