You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "nchammas (via GitHub)" <gi...@apache.org> on 2023/12/11 21:16:22 UTC

[PR] Assign SQL configs to groups for use in documentation [spark]

nchammas opened a new pull request, #44300:
URL: https://github.com/apache/spark/pull/44300

   ### What changes were proposed in this pull request?
   
   Assign the various SQL configs to groups that can then be referenced in our documentation. This will replace the large number of manually maintained HTML tables with references to generated tables.
   
   For example, the SQL performance tuning page has a section on data caching that references two configs. They are in an HTML table that has to be [manually maintained][1].
   
   [1]: https://github.com/apache/spark/blob/7db85642600b1e3b39ca11e41d4e3e0bf1c8962b/docs/sql-performance-tuning.md#L37-L56
   
   With this new system, we can maintain the same documentation output, but instead reference a generated HTML table as follows:
   
   ```md
   {% include_relative generated-sql-config-table-caching-data.html %}
   ```
   
   And in `SQLConf.scala`, we can assign the configs we want to show up in this table to the `caching-data` group:
   
   ```scala
     val COMPRESS_CACHED = buildConf("spark.sql.inMemoryColumnarStorage.compressed")
       .doc("When set to true Spark SQL will automatically select a compression codec for each " +
         "column based on statistics of the data.")
       .version("1.0.1")
       .withTag("caching-data")
   ```
   
   This PR also adds anchors to each config in the generated HTML tables, so that people can link directly to specific configurations in the documentation.
   
   <img width="500" alt="Screenshot 2023-12-11 at 4 13 24 PM" src="https://github.com/apache/spark/assets/1039369/5be2743d-c579-4f23-8b84-0cddcb2b0c91">
   
   This PR builds on the work done in #27459, and is related to the work done in #28274.
   
   ### Why are the changes needed?
   
   To eliminate the need to manually maintain HTML tables for SQL configurations.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it alters some of the user-facing documentation.
   
   ### How was this patch tested?
   
   Manually built the documentation and viewed it in my browser.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation [spark]

Posted by "bjornjorgensen (via GitHub)" <gi...@apache.org>.
bjornjorgensen commented on PR #44300:
URL: https://github.com/apache/spark/pull/44300#issuecomment-1894495683

   @nchammas there seams to be a lot (some) of PR's that have been approved now, but have not been merged to master. So I think it have something to do with Hollidays and so on..
   
   This is not the first PR for fixing the docs. Can an email to @dev be a good thing her?  
       


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] Assign SQL configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #44300:
URL: https://github.com/apache/spark/pull/44300#discussion_r1423145979


##########
docs/README.md:
##########
@@ -46,8 +46,6 @@ $ cd docs
 $ bundle install
 ```
 
-Note: If you are on a system with both Ruby 1.9 and Ruby 2.0 you may need to replace gem with gem2.0.

Review Comment:
   Both Ruby 1 and 2 are EOL.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #44300:
URL: https://github.com/apache/spark/pull/44300#issuecomment-1886262543

   @HyukjinKwon - How are you feeling about this PR? Shall I recruit another committer to take a look instead?
   
   I have the time and motivation to migrate all of our config documentation to use this new system. As I noted in a previous comment, there are around 55 config tables across our documentation that span several thousand lines of manually maintained HTML. They can all be eliminated with this approach. I think it would be a big win for the maintainability of our documentation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #44300:
URL: https://github.com/apache/spark/pull/44300#discussion_r1423147435


##########
sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala:
##########
@@ -72,6 +72,14 @@ private[sql] object PythonSQLUtils extends Logging {
     FunctionRegistry.functionSet.flatMap(f => FunctionRegistry.builtin.lookupFunction(f)).toArray
   }
 
+  def listAllSQLConfigsWithTags():
+    Array[(String, String, String, String, Array[String])] = {
+    val conf = new SQLConf()
+    conf.getAllDefinedConfsWithTags.map(c =>
+      Tuple5(c._1, c._2, c._3, c._4, c._5.toArray)
+    ).toArray
+  }
+
   private def listAllSQLConfigs(): Seq[(String, String, String, String)] = {

Review Comment:
   We can remove some of these other methods if we go with this PR's approach of assigning arbitrary tags to configs.
   
   That includes:
   - `listAllSQLConfigs`
   - `listRuntimeSQLConfigs`
   - `listStaticSQLConfigs`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas closed pull request #44300: [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation
URL: https://github.com/apache/spark/pull/44300


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #44300:
URL: https://github.com/apache/spark/pull/44300#issuecomment-1871661836

   Btw @HyukjinKwon if there is anything I can do to make this PR easier to review, I'm all ears. The key change is the introduction of `withTag` in `ConfigBuilder.scala`, and the rest of the PR flows from that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] Assign SQL configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #44300:
URL: https://github.com/apache/spark/pull/44300#discussion_r1423145509


##########
core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala:
##########
@@ -197,6 +199,13 @@ private[spark] case class ConfigBuilder(key: String) {
     this
   }
 
+  def withTag(tag: String): ConfigBuilder = {

Review Comment:
   I am on the fence regarding this name. Maybe something more explicit like `withDocumentationGroup` would be better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #44300:
URL: https://github.com/apache/spark/pull/44300#issuecomment-1859284321

   Thank you, Hyukjin. (I hope when I edited [this comment][1] it didn't re-ping you. That wasn't my intention.)
   
   [1]: https://github.com/apache/spark/pull/44300#pullrequestreview-1776188809


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46395][SQL] Assign SQL configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #44300:
URL: https://github.com/apache/spark/pull/44300#issuecomment-1858916603

   There are about 55 HTML tables of Spark configs (both SQL and non-SQL) across our documentation. They probably span a few thousand lines of HTML. I believe they can all be replaced with automatically generated HTML tables.
   
   I intend to do that if this PR gets buy-in from committers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #44300:
URL: https://github.com/apache/spark/pull/44300#issuecomment-1894096892

   Silence on a PR usually means there is something wrong with it, so I am proactively trying to figure out how to move this idea forward.
   
   The possibilities I am considering are:
   1. Committers disapprove of the abstract idea of automating the generation of config tables.
   2. Committers approve of the abstract idea but disapprove of the implementation proposed in this PR.
   3. Committers approve of the abstract idea and may approve of the implementation proposed here, but find it difficult to review.
   4. Committers are constrained (due to time, energy, other priorities, etc.) and cannot participate at this time.
   
   In the case of 2, I have proposed a completely different approach in #44756 that does not touch Spark core.
   
   In the case of 3, I pushed #44755, which is a stripped down version of this PR which should be easier to review.
   
   In the case of 1, I need some feedback on why this is a bad idea so I can drop this effort. As you can see, I believe it's a good idea and am eager to push it forward.
   
   In the case of 4, please forgive me for repeatedly pinging this PR and being annoying. 😅 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #44300:
URL: https://github.com/apache/spark/pull/44300#discussion_r1423145979


##########
docs/README.md:
##########
@@ -46,8 +46,6 @@ $ cd docs
 $ bundle install
 ```
 
-Note: If you are on a system with both Ruby 1.9 and Ruby 2.0 you may need to replace gem with gem2.0.

Review Comment:
   Both Ruby 1 and 2 are [EOL](https://www.ruby-lang.org/en/news/2022/04/12/ruby-2-7-6-released/):
   
   > Ruby 2.7 reaches EOL and its official support ends by the end of the security maintenance phase. Therefore, we recommend that you start to plan upgrade to Ruby 3.0 or 3.1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] Assign SQL configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #44300:
URL: https://github.com/apache/spark/pull/44300#discussion_r1423139009


##########
docs/sql-performance-tuning.md:
##########
@@ -34,30 +34,9 @@ memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableNam
 Configuration of in-memory caching can be done using the `setConf` method on `SparkSession` or by running
 `SET key=value` commands using SQL.
 
-<table>
-<thead><tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr></thead>
-<tr>
-  <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
-  <td>true</td>
-  <td>
-    When set to true, Spark SQL will automatically select a compression codec for each column based
-    on statistics of the data.
-  </td>
-  <td>1.0.1</td>
-</tr>
-<tr>
-  <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
-  <td>10000</td>
-  <td>
-    Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization
-    and compression, but risk OOMs when caching data.
-  </td>
-  <td>1.1.1</td>
-</tr>
-
-</table>
+{% include_relative generated-sql-config-table-caching-data.html %}

Review Comment:
   This diff demonstrates the main benefit of this PR. Instead of needing to copy and maintain the full HTML table of some configs, we tags the ones we want to group together in `SQLConf.scala` and then reference that group's table here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] Assign SQL configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #44300:
URL: https://github.com/apache/spark/pull/44300#discussion_r1423139009


##########
docs/sql-performance-tuning.md:
##########
@@ -34,30 +34,9 @@ memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableNam
 Configuration of in-memory caching can be done using the `setConf` method on `SparkSession` or by running
 `SET key=value` commands using SQL.
 
-<table>
-<thead><tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr></thead>
-<tr>
-  <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
-  <td>true</td>
-  <td>
-    When set to true, Spark SQL will automatically select a compression codec for each column based
-    on statistics of the data.
-  </td>
-  <td>1.0.1</td>
-</tr>
-<tr>
-  <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
-  <td>10000</td>
-  <td>
-    Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization
-    and compression, but risk OOMs when caching data.
-  </td>
-  <td>1.1.1</td>
-</tr>
-
-</table>
+{% include_relative generated-sql-config-table-caching-data.html %}

Review Comment:
   This diff demonstrates the main benefit of this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] Assign SQL configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #44300:
URL: https://github.com/apache/spark/pull/44300#discussion_r1423147435


##########
sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala:
##########
@@ -72,6 +72,14 @@ private[sql] object PythonSQLUtils extends Logging {
     FunctionRegistry.functionSet.flatMap(f => FunctionRegistry.builtin.lookupFunction(f)).toArray
   }
 
+  def listAllSQLConfigsWithTags():
+    Array[(String, String, String, String, Array[String])] = {
+    val conf = new SQLConf()
+    conf.getAllDefinedConfsWithTags.map(c =>
+      Tuple5(c._1, c._2, c._3, c._4, c._5.toArray)
+    ).toArray
+  }
+
   private def listAllSQLConfigs(): Seq[(String, String, String, String)] = {

Review Comment:
   We can remove some of these other methods if we go with this PR's approach of assigning arbitrary tags to configs.
   
   That includes:
   - `listAllSQLConfigs`
   - `listRuntimeSQLConfigs`
   - `listStaticSQLConfigs`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] Assign SQL configs to groups for use in documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #44300:
URL: https://github.com/apache/spark/pull/44300#discussion_r1423186667


##########
core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala:
##########
@@ -197,6 +199,13 @@ private[spark] case class ConfigBuilder(key: String) {
     this
   }
 
+  def withTag(tag: String): ConfigBuilder = {

Review Comment:
   We could also do away with custom group names and instead select certain prefixes for use as documentation groups, like `spark.sql.cbo`, `spark.sql.statistics`, etc.
   
   However, this won't work for groupings that don't align with config name prefixes, like the breakdown of runtime vs. static configurations that was added in #28274.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44300:
URL: https://github.com/apache/spark/pull/44300#issuecomment-1859281306

   (will try to take a look around next week)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org