You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "nchammas (via GitHub)" <gi...@apache.org> on 2024/03/04 15:59:18 UTC

[PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

nchammas opened a new pull request, #45374:
URL: https://github.com/apache/spark/pull/45374

   ### What changes were proposed in this pull request?
   
   Add a new section to the SQL performance tuning page on statistics.
   
   Tweak some related documentation relevant to this new section.
   
   ### Why are the changes needed?
   
   Statistics are a basic and critical part of how Spark optimizes queries and achieves high performance. Yet, they receive no discussion in our performance tuning guide.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, new user-facing documentation.
   
   <img width="500" src="https://github.com/apache/spark/assets/1039369/621e1c85-482e-42a0-baad-44e0b11aa172" />
   
   ### How was this patch tested?
   
   Built and reviewed the docs locally.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45374:
URL: https://github.com/apache/spark/pull/45374#discussion_r1511987787


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -582,11 +582,7 @@ object SQLConf {
 
   val AUTO_BROADCASTJOIN_THRESHOLD = buildConf("spark.sql.autoBroadcastJoinThreshold")
     .doc("Configures the maximum size in bytes for a table that will be broadcast to all worker " +
-      "nodes when performing a join.  By setting this value to -1 broadcasting can be disabled. " +
-      "Note that currently statistics are only supported for Hive Metastore tables where the " +

Review Comment:
   qq: why do we remove this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #45374: [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page
URL: https://github.com/apache/spark/pull/45374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #45374:
URL: https://github.com/apache/spark/pull/45374#discussion_r1512072983


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -582,11 +582,7 @@ object SQLConf {
 
   val AUTO_BROADCASTJOIN_THRESHOLD = buildConf("spark.sql.autoBroadcastJoinThreshold")
     .doc("Configures the maximum size in bytes for a table that will be broadcast to all worker " +
-      "nodes when performing a join.  By setting this value to -1 broadcasting can be disabled. " +
-      "Note that currently statistics are only supported for Hive Metastore tables where the " +

Review Comment:
   The comment `are only supported for Hive Metastore tables` seems inaccurate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #45374:
URL: https://github.com/apache/spark/pull/45374#discussion_r1511667632


##########
docs/sql-performance-tuning.md:
##########
@@ -157,6 +157,18 @@ SELECT /*+ REBALANCE(3, c) */ * FROM t;
 
 For more details please refer to the documentation of [Partitioning Hints](sql-ref-syntax-qry-select-hints.html#partitioning-hints).
 
+## Leveraging Statistics
+Apache Spark's ability to choose the best execution plan among many possible options is determined in part by its estimates of how many rows will be output by every node in the execution plan (read, filter, join, etc.). Those estimates in turn are based on statistics that are made available to Spark in one of several ways:
+
+- **Data source**: Statistics that Spark reads directly from the underlying data source, like the counts and min/max values in the metadata of Parquet files. These statistics are maintained by the underlying data source.
+- **Catalog**: Statistics that Spark reads from the catalog, like the Hive Metastore. These statistics are collected or updated whenever you run [`ANALYZE TABLE`](sql-ref-syntax-aux-analyze-table.html).
+- **Runtime**: Statistics that Spark computes itself at runtime as a job is running.

Review Comment:
   I'm curious what statistics are considered runtime? Do we want to add examples here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.

nchammas commented on code in PR #45374:
URL: https://github.com/apache/spark/pull/45374#discussion_r1513144146


##########
docs/sql-performance-tuning.md:
##########
@@ -157,6 +157,18 @@ SELECT /*+ REBALANCE(3, c) */ * FROM t;
 
 For more details please refer to the documentation of [Partitioning Hints](sql-ref-syntax-qry-select-hints.html#partitioning-hints).
 
+## Leveraging Statistics
+Apache Spark's ability to choose the best execution plan among many possible options is determined in part by its estimates of how many rows will be output by every node in the execution plan (read, filter, join, etc.). Those estimates in turn are based on statistics that are made available to Spark in one of several ways:
+
+- **Data source**: Statistics that Spark reads directly from the underlying data source, like the counts and min/max values in the metadata of Parquet files. These statistics are maintained by the underlying data source.
+- **Catalog**: Statistics that Spark reads from the catalog, like the Hive Metastore. These statistics are collected or updated whenever you run [`ANALYZE TABLE`](sql-ref-syntax-aux-analyze-table.html).
+- **Runtime**: Statistics that Spark computes itself at runtime as a job is running.

Review Comment:
   I've explained this a bit more here and a few lines below. I've also updated the screenshot in the PR description.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.

nchammas commented on code in PR #45374:
URL: https://github.com/apache/spark/pull/45374#discussion_r1514934182


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -582,11 +582,7 @@ object SQLConf {
 
   val AUTO_BROADCASTJOIN_THRESHOLD = buildConf("spark.sql.autoBroadcastJoinThreshold")
     .doc("Configures the maximum size in bytes for a table that will be broadcast to all worker " +
-      "nodes when performing a join.  By setting this value to -1 broadcasting can be disabled. " +
-      "Note that currently statistics are only supported for Hive Metastore tables where the " +

Review Comment:
   Yes, this comment was added around 9 years ago and is out of date.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.

nchammas commented on code in PR #45374:
URL: https://github.com/apache/spark/pull/45374#discussion_r1511715423


##########
docs/sql-performance-tuning.md:
##########
@@ -157,6 +157,18 @@ SELECT /*+ REBALANCE(3, c) */ * FROM t;
 
 For more details please refer to the documentation of [Partitioning Hints](sql-ref-syntax-qry-select-hints.html#partitioning-hints).
 
+## Leveraging Statistics
+Apache Spark's ability to choose the best execution plan among many possible options is determined in part by its estimates of how many rows will be output by every node in the execution plan (read, filter, join, etc.). Those estimates in turn are based on statistics that are made available to Spark in one of several ways:
+
+- **Data source**: Statistics that Spark reads directly from the underlying data source, like the counts and min/max values in the metadata of Parquet files. These statistics are maintained by the underlying data source.
+- **Catalog**: Statistics that Spark reads from the catalog, like the Hive Metastore. These statistics are collected or updated whenever you run [`ANALYZE TABLE`](sql-ref-syntax-aux-analyze-table.html).
+- **Runtime**: Statistics that Spark computes itself at runtime as a job is running.

Review Comment:
   I believe this concept is specific to adaptive query execution. It was [introduced here][1]. I'll clarify this point accordingly.
   
   [1]: https://github.com/apache/spark/pull/32391/files#diff-c42595588570b41aeb8e33e98553522a53cd46751abfc98b1af3495cf4f5bc15R50-R51



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.

nchammas commented on code in PR #45374:
URL: https://github.com/apache/spark/pull/45374#discussion_r1512030136


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -582,11 +582,7 @@ object SQLConf {
 
   val AUTO_BROADCASTJOIN_THRESHOLD = buildConf("spark.sql.autoBroadcastJoinThreshold")
     .doc("Configures the maximum size in bytes for a table that will be broadcast to all worker " +
-      "nodes when performing a join.  By setting this value to -1 broadcasting can be disabled. " +
-      "Note that currently statistics are only supported for Hive Metastore tables where the " +

Review Comment:
   Fair question. I removed it because I don't think it explains anything.
   
   Across all of Spark, statistics come from one of the three sources I described in this PR: data source, catalog, and runtime. And this applies to all cost-based optimizations, not just to auto-broadcast. Isn't that so?
   
   So I thought it would be better to remove this comment since it indirectly suggests that there is something special about auto-broadcast and statistics, when that isn't the case.
   
   But I confess I am concluding this based on a high-level understanding of the optimizer. I didn't dig in to the details of this particular optimization to see if there is anything really special about it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #45374:
URL: https://github.com/apache/spark/pull/45374#discussion_r1513548381


##########
docs/sql-performance-tuning.md:
##########
@@ -157,6 +157,18 @@ SELECT /*+ REBALANCE(3, c) */ * FROM t;
 
 For more details please refer to the documentation of [Partitioning Hints](sql-ref-syntax-qry-select-hints.html#partitioning-hints).
 
+## Leveraging Statistics
+Apache Spark's ability to choose the best execution plan among many possible options is determined in part by its estimates of how many rows will be output by every node in the execution plan (read, filter, join, etc.). Those estimates in turn are based on statistics that are made available to Spark in one of several ways:
+
+- **Data source**: Statistics that Spark reads directly from the underlying data source, like the counts and min/max values in the metadata of Parquet files. These statistics are maintained by the underlying data source.
+- **Catalog**: Statistics that Spark reads from the catalog, like the Hive Metastore. These statistics are collected or updated whenever you run [`ANALYZE TABLE`](sql-ref-syntax-aux-analyze-table.html).
+- **Runtime**: Statistics that Spark computes itself at runtime as a job is running.

Review Comment:
   Thank you @nchammas , that's very helpful!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45374:
URL: https://github.com/apache/spark/pull/45374#issuecomment-1982061440

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org