You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2020/02/09 10:23:21 UTC
[spark] branch branch-3.0 updated: [SPARK-30510][SQL][DOCS]
Publicly document Spark SQL configuration options
This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.0 by this push:
new d586549 [SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options
d586549 is described below
commit d5865493ae71e6369e9f3350dd7e694afcf57298
Author: Nicholas Chammas <ni...@liveramp.com>
AuthorDate: Sun Feb 9 19:20:47 2020 +0900
[SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options
### What changes were proposed in this pull request?
This PR adds a doc builder for Spark SQL's configuration options.
Here's what the new Spark SQL config docs look like ([configuration.html.zip](https://github.com/apache/spark/files/4172109/configuration.html.zip)):
![Screen Shot 2020-02-07 at 12 13 23 PM](https://user-images.githubusercontent.com/1039369/74050007-425b5480-49a3-11ea-818c-42700c54d1fb.png)
Compare this to the [current docs](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql):
![Screen Shot 2020-02-04 at 4 55 10 PM](https://user-images.githubusercontent.com/1039369/73790828-24a5a980-476f-11ea-998c-12cd613883e8.png)
### Why are the changes needed?
There is no visibility into the various Spark SQL configs on [the config docs page](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql).
### Does this PR introduce any user-facing change?
No, apart from new documentation.
### How was this patch tested?
I tested this manually by building the docs and reviewing them in my browser.
Closes #27459 from nchammas/SPARK-30510-spark-sql-options.
Authored-by: Nicholas Chammas <ni...@liveramp.com>
Signed-off-by: HyukjinKwon <gu...@apache.org>
(cherry picked from commit 339c0f9a623521acd4d66292b3fe3e6c4ec3b108)
Signed-off-by: HyukjinKwon <gu...@apache.org>
---
docs/.gitignore | 1 +
docs/configuration.md | 46 ++------
sql/README.md | 2 +-
.../org/apache/spark/sql/internal/SQLConf.scala | 35 +++---
.../spark/sql/api/python/PythonSQLUtils.scala | 7 ++
sql/create-docs.sh | 14 +--
sql/{gen-sql-markdown.py => gen-sql-api-docs.py} | 8 +-
sql/gen-sql-config-docs.py | 117 +++++++++++++++++++++
8 files changed, 163 insertions(+), 67 deletions(-)
diff --git a/docs/.gitignore b/docs/.gitignore
new file mode 100644
index 0000000..2260493
--- /dev/null
+++ b/docs/.gitignore
@@ -0,0 +1 @@
+sql-configs.html
diff --git a/docs/configuration.md b/docs/configuration.md
index 5bd3f3e..1343755 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -2399,47 +2399,15 @@ the driver or executor, or, in the absence of that value, the number of cores av
Please refer to the [Security](security.html) page for available options on how to secure different
Spark subsystems.
-### Spark SQL
-
-Running the <code>SET -v</code> command will show the entire list of the SQL configuration.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-{% highlight scala %}
-// spark is an existing SparkSession
-spark.sql("SET -v").show(numRows = 200, truncate = false)
-{% endhighlight %}
-
-</div>
-
-<div data-lang="java" markdown="1">
-
-{% highlight java %}
-// spark is an existing SparkSession
-spark.sql("SET -v").show(200, false);
-{% endhighlight %}
-</div>
-
-<div data-lang="python" markdown="1">
-
-{% highlight python %}
-# spark is an existing SparkSession
-spark.sql("SET -v").show(n=200, truncate=False)
-{% endhighlight %}
-
-</div>
-
-<div data-lang="r" markdown="1">
-
-{% highlight r %}
-sparkR.session()
-properties <- sql("SET -v")
-showDF(properties, numRows = 200, truncate = FALSE)
-{% endhighlight %}
+{% for static_file in site.static_files %}
+ {% if static_file.name == 'sql-configs.html' %}
+### Spark SQL
-</div>
-</div>
+ {% include_relative sql-configs.html %}
+ {% break %}
+ {% endif %}
+{% endfor %}
### Spark Streaming
diff --git a/sql/README.md b/sql/README.md
index 67e3225..ae5ebd1 100644
--- a/sql/README.md
+++ b/sql/README.md
@@ -9,4 +9,4 @@ Spark SQL is broken up into four subprojects:
- Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
- HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.
-Running `./sql/create-docs.sh` generates SQL documentation for built-in functions under `sql/site`.
+Running `./sql/create-docs.sh` generates SQL documentation for built-in functions under `sql/site`, and SQL configuration documentation that gets included as part of `configuration.md` in the main `docs` directory.
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index a72bd53..e38fe76 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -324,11 +324,11 @@ object SQLConf {
.doc("Configures the maximum size in bytes for a table that will be broadcast to all worker " +
"nodes when performing a join. By setting this value to -1 broadcasting can be disabled. " +
"Note that currently statistics are only supported for Hive Metastore tables where the " +
- "command <code>ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan</code> has been " +
+ "command `ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan` has been " +
"run, and file-based data source tables where the statistics are computed directly on " +
"the files of data.")
.bytesConf(ByteUnit.BYTE)
- .createWithDefault(10L * 1024 * 1024)
+ .createWithDefaultString("10MB")
val LIMIT_SCALE_UP_FACTOR = buildConf("spark.sql.limit.scaleUpFactor")
.internal()
@@ -402,7 +402,7 @@ object SQLConf {
s"an effect when '${ADAPTIVE_EXECUTION_ENABLED.key}' and " +
s"'${REDUCE_POST_SHUFFLE_PARTITIONS_ENABLED.key}' is enabled.")
.bytesConf(ByteUnit.BYTE)
- .createWithDefault(64 * 1024 * 1024)
+ .createWithDefaultString("64MB")
val SHUFFLE_MAX_NUM_POSTSHUFFLE_PARTITIONS =
buildConf("spark.sql.adaptive.shuffle.maxNumPostShufflePartitions")
@@ -436,7 +436,7 @@ object SQLConf {
.doc("Configures the minimum size in bytes for a partition that is considered as a skewed " +
"partition in adaptive skewed join.")
.bytesConf(ByteUnit.BYTE)
- .createWithDefault(64 * 1024 * 1024)
+ .createWithDefaultString("64MB")
val ADAPTIVE_EXECUTION_SKEWED_PARTITION_FACTOR =
buildConf("spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionFactor")
@@ -770,7 +770,7 @@ object SQLConf {
val BROADCAST_TIMEOUT = buildConf("spark.sql.broadcastTimeout")
.doc("Timeout in seconds for the broadcast wait time in broadcast joins.")
.timeConf(TimeUnit.SECONDS)
- .createWithDefault(5 * 60)
+ .createWithDefaultString(s"${5 * 60}")
// This is only used for the thriftserver
val THRIFTSERVER_POOL = buildConf("spark.sql.thriftserver.scheduler.pool")
@@ -830,7 +830,7 @@ object SQLConf {
.createWithDefault(true)
val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.sources.bucketing.maxBuckets")
- .doc("The maximum number of buckets allowed. Defaults to 100000")
+ .doc("The maximum number of buckets allowed.")
.intConf
.checkValue(_ > 0, "the value of spark.sql.sources.bucketing.maxBuckets must be greater than 0")
.createWithDefault(100000)
@@ -1022,7 +1022,7 @@ object SQLConf {
"This configuration is effective only when using file-based sources such as Parquet, JSON " +
"and ORC.")
.bytesConf(ByteUnit.BYTE)
- .createWithDefault(128 * 1024 * 1024) // parquet.block.size
+ .createWithDefaultString("128MB") // parquet.block.size
val FILES_OPEN_COST_IN_BYTES = buildConf("spark.sql.files.openCostInBytes")
.internal()
@@ -1161,7 +1161,8 @@ object SQLConf {
val VARIABLE_SUBSTITUTE_ENABLED =
buildConf("spark.sql.variable.substitute")
- .doc("This enables substitution using syntax like ${var} ${system:var} and ${env:var}.")
+ .doc("This enables substitution using syntax like `${var}`, `${system:var}`, " +
+ "and `${env:var}`.")
.booleanConf
.createWithDefault(true)
@@ -1171,7 +1172,7 @@ object SQLConf {
.doc("Enable two-level aggregate hash map. When enabled, records will first be " +
"inserted/looked-up at a 1st-level, small, fast map, and then fallback to a " +
"2nd-level, larger, slower map when 1st level is full or keys cannot be found. " +
- "When disabled, records go directly to the 2nd level. Defaults to true.")
+ "When disabled, records go directly to the 2nd level.")
.booleanConf
.createWithDefault(true)
@@ -1325,10 +1326,10 @@ object SQLConf {
val STREAMING_STOP_TIMEOUT =
buildConf("spark.sql.streaming.stopTimeout")
- .doc("How long to wait for the streaming execution thread to stop when calling the " +
- "streaming query's stop() method in milliseconds. 0 or negative values wait indefinitely.")
+ .doc("How long to wait in milliseconds for the streaming execution thread to stop when " +
+ "calling the streaming query's stop() method. 0 or negative values wait indefinitely.")
.timeConf(TimeUnit.MILLISECONDS)
- .createWithDefault(0L)
+ .createWithDefaultString("0")
val STREAMING_NO_DATA_PROGRESS_EVENT_INTERVAL =
buildConf("spark.sql.streaming.noDataProgressEventInterval")
@@ -1611,10 +1612,10 @@ object SQLConf {
val PANDAS_UDF_BUFFER_SIZE =
buildConf("spark.sql.pandas.udf.buffer.size")
.doc(
- s"Same as ${BUFFER_SIZE} but only applies to Pandas UDF executions. If it is not set, " +
- s"the fallback is ${BUFFER_SIZE}. Note that Pandas execution requires more than 4 bytes. " +
- "Lowering this value could make small Pandas UDF batch iterated and pipelined; however, " +
- "it might degrade performance. See SPARK-27870.")
+ s"Same as `${BUFFER_SIZE.key}` but only applies to Pandas UDF executions. If it is not " +
+ s"set, the fallback is `${BUFFER_SIZE.key}`. Note that Pandas execution requires more " +
+ "than 4 bytes. Lowering this value could make small Pandas UDF batch iterated and " +
+ "pipelined; however, it might degrade performance. See SPARK-27870.")
.fallbackConf(BUFFER_SIZE)
val PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_NAME =
@@ -2039,7 +2040,7 @@ object SQLConf {
.checkValue(i => i >= 0 && i <= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH, "Invalid " +
"value for 'spark.sql.maxPlanStringLength'. Length must be a valid string length " +
"(nonnegative and shorter than the maximum size).")
- .createWithDefault(ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH)
+ .createWithDefaultString(s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}")
val SET_COMMAND_REJECTS_SPARK_CORE_CONFS =
buildConf("spark.sql.legacy.setCommandRejectsSparkCoreConfs")
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala b/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
index b232aa1..bf3055d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
@@ -29,6 +29,7 @@ import org.apache.spark.sql.catalyst.expressions.ExpressionInfo
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.apache.spark.sql.execution.{ExplainMode, QueryExecution}
import org.apache.spark.sql.execution.arrow.ArrowConverters
+import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.DataType
private[sql] object PythonSQLUtils {
@@ -39,6 +40,12 @@ private[sql] object PythonSQLUtils {
FunctionRegistry.functionSet.flatMap(f => FunctionRegistry.builtin.lookupFunction(f)).toArray
}
+ def listSQLConfigs(): Array[(String, String, String)] = {
+ val conf = new SQLConf()
+ // Py4J doesn't seem to translate Seq well, so we convert to an Array.
+ conf.getAllDefinedConfs.toArray
+ }
+
/**
* Python callable function to read a file in Arrow stream format and create a [[RDD]]
* using each serialized ArrowRecordBatch as a partition.
diff --git a/sql/create-docs.sh b/sql/create-docs.sh
index 4353708..44aa877 100755
--- a/sql/create-docs.sh
+++ b/sql/create-docs.sh
@@ -17,7 +17,7 @@
# limitations under the License.
#
-# Script to create SQL API docs. This requires `mkdocs` and to build
+# Script to create SQL API and config docs. This requires `mkdocs` and to build
# Spark first. After running this script the html docs can be found in
# $SPARK_HOME/sql/site
@@ -39,14 +39,16 @@ fi
pushd "$FWDIR" > /dev/null
-# Now create the markdown file
rm -fr docs
mkdir docs
-echo "Generating markdown files for SQL documentation."
-"$SPARK_HOME/bin/spark-submit" gen-sql-markdown.py
-# Now create the HTML files
-echo "Generating HTML files for SQL documentation."
+echo "Generating SQL API Markdown files."
+"$SPARK_HOME/bin/spark-submit" gen-sql-api-docs.py
+
+echo "Generating SQL configuration table HTML file."
+"$SPARK_HOME/bin/spark-submit" gen-sql-config-docs.py
+
+echo "Generating HTML files for SQL API documentation."
mkdocs build --clean
rm -fr docs
diff --git a/sql/gen-sql-markdown.py b/sql/gen-sql-api-docs.py
similarity index 96%
rename from sql/gen-sql-markdown.py
rename to sql/gen-sql-api-docs.py
index e0529f8..4feee7a 100644
--- a/sql/gen-sql-markdown.py
+++ b/sql/gen-sql-api-docs.py
@@ -15,10 +15,11 @@
# limitations under the License.
#
-import sys
import os
from collections import namedtuple
+from pyspark.java_gateway import launch_gateway
+
ExpressionInfo = namedtuple(
"ExpressionInfo", "className name usage arguments examples note since deprecated")
@@ -219,8 +220,7 @@ def generate_sql_markdown(jvm, path):
if __name__ == "__main__":
- from pyspark.java_gateway import launch_gateway
-
jvm = launch_gateway().jvm
- markdown_file_path = "%s/docs/index.md" % os.path.dirname(sys.argv[0])
+ spark_root_dir = os.path.dirname(os.path.dirname(__file__))
+ markdown_file_path = os.path.join(spark_root_dir, "sql/docs/index.md")
generate_sql_markdown(jvm, markdown_file_path)
diff --git a/sql/gen-sql-config-docs.py b/sql/gen-sql-config-docs.py
new file mode 100644
index 0000000..04f5a85
--- /dev/null
+++ b/sql/gen-sql-config-docs.py
@@ -0,0 +1,117 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import re
+from collections import namedtuple
+from textwrap import dedent
+
+# To avoid adding a new direct dependency, we import markdown from within mkdocs.
+from mkdocs.structure.pages import markdown
+from pyspark.java_gateway import launch_gateway
+
+SQLConfEntry = namedtuple(
+ "SQLConfEntry", ["name", "default", "description"])
+
+
+def get_public_sql_configs(jvm):
+ sql_configs = [
+ SQLConfEntry(
+ name=_sql_config._1(),
+ default=_sql_config._2(),
+ description=_sql_config._3(),
+ )
+ for _sql_config in jvm.org.apache.spark.sql.api.python.PythonSQLUtils.listSQLConfigs()
+ ]
+ return sql_configs
+
+
+def generate_sql_configs_table(sql_configs, path):
+ """
+ Generates an HTML table at `path` that lists all public SQL
+ configuration options.
+
+ The table will look something like this:
+
+ ```html
+ <table class="table">
+ <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+
+ <tr>
+ <td><code>spark.sql.adaptive.enabled</code></td>
+ <td>false</td>
+ <td><p>When true, enable adaptive query execution.</p></td>
+ </tr>
+
+ ...
+
+ </table>
+ ```
+ """
+ value_reference_pattern = re.compile(r"^<value of (\S*)>$")
+
+ with open(path, 'w') as f:
+ f.write(dedent(
+ """
+ <table class="table">
+ <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+ """
+ ))
+ for config in sorted(sql_configs, key=lambda x: x.name):
+ if config.default == "<undefined>":
+ default = "(none)"
+ elif config.default.startswith("<value of "):
+ referenced_config_name = value_reference_pattern.match(config.default).group(1)
+ default = "(value of <code>{}</code>)".format(referenced_config_name)
+ else:
+ default = config.default
+
+ if default.startswith("<"):
+ raise Exception(
+ "Unhandled reference in SQL config docs. Config '{name}' "
+ "has default '{default}' that looks like an HTML tag."
+ .format(
+ name=config.name,
+ default=config.default,
+ )
+ )
+
+ f.write(dedent(
+ """
+ <tr>
+ <td><code>{name}</code></td>
+ <td>{default}</td>
+ <td>{description}</td>
+ </tr>
+ """
+ .format(
+ name=config.name,
+ default=default,
+ description=markdown.markdown(config.description),
+ )
+ ))
+ f.write("</table>\n")
+
+
+if __name__ == "__main__":
+ jvm = launch_gateway().jvm
+ sql_configs = get_public_sql_configs(jvm)
+
+ spark_root_dir = os.path.dirname(os.path.dirname(__file__))
+ sql_configs_table_path = os.path.join(spark_root_dir, "docs/sql-configs.html")
+
+ generate_sql_configs_table(sql_configs, path=sql_configs_table_path)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org