You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2020/04/07 12:57:50 UTC

[spark] 02/04: [SPARK-31295][DOC][FOLLOWUP] Supplement version for configuration appear in doc

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 8af58ebd958813e9ff29d2f0d1b070d529ba1275
Author: beliefer <be...@163.com>
AuthorDate: Thu Apr 2 16:01:54 2020 +0900

    [SPARK-31295][DOC][FOLLOWUP] Supplement version for configuration appear in doc
    
    ### What changes were proposed in this pull request?
    This PR supplements version for configuration appear in docs.
    I sorted out some information show below.
    
    **docs/sql-performance-tuning.md**
    Item name | Since version | JIRA ID | Commit ID | Note
    -- | -- | -- | -- | --
    spark.sql.inMemoryColumnarStorage.compressed | 1.0.1 | SPARK-2631 | 86534d0f5255362618c05a07b0171ec35c915822#diff-41ef65b9ef5b518f77e2a03559893f4d |  
    spark.sql.inMemoryColumnarStorage.batchSize | 1.1.1 | SPARK-2650 | 779d1eb26d0f031791e93c908d51a59c3b422a55#diff-41ef65b9ef5b518f77e2a03559893f4d |  
    spark.sql.files.maxPartitionBytes | 2.0.0 | SPARK-13664 | 17eec0a71ba8713c559d641e3f43a1be726b037c#diff-32bb9518401c0948c5ea19377b5069ab |  
    spark.sql.files.openCostInBytes | 2.0.0 | SPARK-14259 | 400b2f863ffaa01a34a8dae1541c61526fef908b#diff-32bb9518401c0948c5ea19377b5069ab |  
    spark.sql.broadcastTimeout | 1.3.0 | SPARK-4269 | fa66ef6c97e87c9255b67b03836a4ba50598ebae#diff-41ef65b9ef5b518f77e2a03559893f4d |  
    spark.sql.autoBroadcastJoinThreshold | 1.1.0 | SPARK-2393 | c7db274be79f448fda566208946cb50958ea9b1a#diff-41ef65b9ef5b518f77e2a03559893f4d |  
    spark.sql.shuffle.partitions | 1.1.0 | SPARK-1508 | 08ed9ad81397b71206c4dc903bfb94b6105691ed#diff-41ef65b9ef5b518f77e2a03559893f4d |  
    spark.sql.adaptive.coalescePartitions.enabled | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
    spark.sql.adaptive.coalescePartitions.minPartitionNum | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
    spark.sql.adaptive.coalescePartitions.initialPartitionNum | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
    spark.sql.adaptive.advisoryPartitionSizeInBytes | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
    spark.sql.adaptive.skewJoin.enabled | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
    spark.sql.adaptive.skewJoin.skewedPartitionFactor | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
    spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes | 3.0.0 | SPARK-31201 | 8d0800a0803d3c47938bddefa15328d654739bc5#diff-9a6b543db706f1a90f790783d6930a13 |  
    
    **docs/sql-ref-ansi-compliance.md**
    Item name | Since version | JIRA ID | Commit ID | Note
    -- | -- | -- | -- | --
    spark.sql.ansi.enabled | 3.0.0 | SPARK-30125 | d9b30694122f8716d3acb448638ef1e2b96ebc7a#diff-9a6b543db706f1a90f790783d6930a13 |  
    spark.sql.storeAssignmentPolicy | 3.0.0 | SPARK-28730 | 895c90b582cc2b2667241f66d5b733852aeef9eb#diff-9a6b543db706f1a90f790783d6930a13 |
    
    ### Why are the changes needed?
    Supplemental configuration version information.
    
    ### Does this PR introduce any user-facing change?
    'No'.
    
    ### How was this patch tested?
    Jenkins test
    
    Closes #28096 from beliefer/supplement-version-of-performance.
    
    Authored-by: beliefer <be...@163.com>
    Signed-off-by: HyukjinKwon <gu...@apache.org>
---
 docs/sql-performance-tuning.md  | 30 ++++++++++++++++++++++--------
 docs/sql-ref-ansi-compliance.md |  4 +++-
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
index 9a1cc89..279aad6 100644
--- a/docs/sql-performance-tuning.md
+++ b/docs/sql-performance-tuning.md
@@ -35,7 +35,7 @@ Configuration of in-memory caching can be done using the `setConf` method on `Sp
 `SET key=value` commands using SQL.
 
 <table class="table">
-<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
 <tr>
   <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
   <td>true</td>
@@ -43,6 +43,7 @@ Configuration of in-memory caching can be done using the `setConf` method on `Sp
     When set to true Spark SQL will automatically select a compression codec for each column based
     on statistics of the data.
   </td>
+  <td>1.0.1</td>
 </tr>
 <tr>
   <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
@@ -51,6 +52,7 @@ Configuration of in-memory caching can be done using the `setConf` method on `Sp
     Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization
     and compression, but risk OOMs when caching data.
   </td>
+  <td>1.1.1</td>
 </tr>
 
 </table>
@@ -61,7 +63,7 @@ The following options can also be used to tune the performance of query executio
 that these options will be deprecated in future release as more optimizations are performed automatically.
 
 <table class="table">
-  <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+  <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
   <tr>
     <td><code>spark.sql.files.maxPartitionBytes</code></td>
     <td>134217728 (128 MB)</td>
@@ -69,6 +71,7 @@ that these options will be deprecated in future release as more optimizations ar
       The maximum number of bytes to pack into a single partition when reading files.
       This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
     </td>
+    <td>2.0.0</td>
   </tr>
   <tr>
     <td><code>spark.sql.files.openCostInBytes</code></td>
@@ -80,15 +83,17 @@ that these options will be deprecated in future release as more optimizations ar
       scheduled first). This configuration is effective only when using file-based sources such as Parquet,
       JSON and ORC.
     </td>
+    <td>2.0.0</td>
   </tr>
   <tr>
     <td><code>spark.sql.broadcastTimeout</code></td>
     <td>300</td>
     <td>
-    <p>
-      Timeout in seconds for the broadcast wait time in broadcast joins
-    </p>
+      <p>
+        Timeout in seconds for the broadcast wait time in broadcast joins
+      </p>
     </td>
+    <td>1.3.0</td>
   </tr>
   <tr>
     <td><code>spark.sql.autoBroadcastJoinThreshold</code></td>
@@ -99,6 +104,7 @@ that these options will be deprecated in future release as more optimizations ar
       statistics are only supported for Hive Metastore tables where the command
       <code>ANALYZE TABLE &lt;tableName&gt; COMPUTE STATISTICS noscan</code> has been run.
     </td>
+    <td>1.1.0</td>
   </tr>
   <tr>
     <td><code>spark.sql.shuffle.partitions</code></td>
@@ -106,6 +112,7 @@ that these options will be deprecated in future release as more optimizations ar
     <td>
       Configures the number of partitions to use when shuffling data for joins or aggregations.
     </td>
+    <td>1.1.0</td>
   </tr>
 </table>
 
@@ -193,13 +200,14 @@ Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that ma
 ### Coalescing Post Shuffle Partitions
 This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configurations are true. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions  [...]
  <table class="table">
-   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
    <tr>
      <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
      <td>true</td>
      <td>
        When true and <code>spark.sql.adaptive.enabled</code> is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>), to avoid too many small tasks.
      </td>
+     <td>3.0.0</td>
    </tr>
    <tr>
      <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
@@ -207,6 +215,7 @@ This feature coalesces the post shuffle partitions based on the map output stati
      <td>
        The minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
      </td>
+     <td>3.0.0</td>
    </tr>
    <tr>
      <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
@@ -214,6 +223,7 @@ This feature coalesces the post shuffle partitions based on the map output stati
      <td>
        The initial number of shuffle partitions before coalescing. By default it equals to <code>spark.sql.shuffle.partitions</code>. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
      </td>
+     <td>3.0.0</td>
    </tr>
    <tr>
      <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
@@ -221,6 +231,7 @@ This feature coalesces the post shuffle partitions based on the map output stati
      <td>
        The advisory size in bytes of the shuffle partition during adaptive optimization (when <code>spark.sql.adaptive.enabled</code> is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.
      </td>
+     <td>3.0.0</td>
    </tr>
  </table>
  
@@ -230,13 +241,14 @@ AQE converts sort-merge join to broadcast hash join when the runtime statistics
 ### Optimizing Skew Join
 Data skew can severely downgrade the performance of join queries. This feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks. It takes effect when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled` configurations are enabled.
   <table class="table">
-     <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+     <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
      <tr>
        <td><code>spark.sql.adaptive.skewJoin.enabled</code></td>
        <td>true</td>
        <td>
          When true and <code>spark.sql.adaptive.enabled</code> is true, Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions.
        </td>
+       <td>3.0.0</td>
      </tr>
      <tr>
        <td><code>spark.sql.adaptive.skewJoin.skewedPartitionFactor</code></td>
@@ -244,6 +256,7 @@ Data skew can severely downgrade the performance of join queries. This feature d
        <td>
          A partition is considered as skewed if its size is larger than this factor multiplying the median partition size and also larger than <code>spark.sql.adaptive.skewedPartitionThresholdInBytes</code>.
        </td>
+       <td>3.0.0</td>
      </tr>
      <tr>
        <td><code>spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes</code></td>
@@ -251,5 +264,6 @@ Data skew can severely downgrade the performance of join queries. This feature d
        <td>
          A partition is considered as skewed if its size in bytes is larger than this threshold and also larger than <code>spark.sql.adaptive.skewJoin.skewedPartitionFactor</code> multiplying the median partition size. Ideally this config should be set larger than <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>.
        </td>
+       <td>3.0.0</td>
      </tr>
-   </table>
\ No newline at end of file
+   </table>
diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index bc5bde6..83affb9 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -28,7 +28,7 @@ The casting behaviours are defined as store assignment rules in the standard.
 When `spark.sql.storeAssignmentPolicy` is set to `ANSI`, Spark SQL complies with the ANSI store assignment rules. This is a separate configuration because its default value is `ANSI`, while the configuration `spark.sql.ansi.enabled` is disabled by default.
 
 <table class="table">
-<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
 <tr>
   <td><code>spark.sql.ansi.enabled</code></td>
   <td>false</td>
@@ -37,6 +37,7 @@ When `spark.sql.storeAssignmentPolicy` is set to `ANSI`, Spark SQL complies with
     1. Spark will throw a runtime exception if an overflow occurs in any operation on integral/decimal field.
     2. Spark will forbid using the reserved keywords of ANSI SQL as identifiers in the SQL parser.
   </td>
+  <td>3.0.0</td>
 </tr>
 <tr>
   <td><code>spark.sql.storeAssignmentPolicy</code></td>
@@ -52,6 +53,7 @@ When `spark.sql.storeAssignmentPolicy` is set to `ANSI`, Spark SQL complies with
     With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion,
     e.g. converting double to int or decimal to double is not allowed.
   </td>
+  <td>3.0.0</td>
 </tr>
 </table>
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org