You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2023/03/02 00:19:18 UTC
[spark] branch master updated: [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide
This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new ef334ae6a68 [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide
ef334ae6a68 is described below
commit ef334ae6a6889bbfba8b6c7afeb71b1ca1df87eb
Author: Allan Folting <al...@databricks.com>
AuthorDate: Thu Mar 2 09:19:05 2023 +0900
[SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide
### What changes were proposed in this pull request?
Making Python the first tab for code examples in the Spark SQL, DataFrames and Datasets Guide.
### Why are the changes needed?
Python is the easiest approachable and most popular language so this change moves it to the first tab (showing by default) for code examples.
### Does this PR introduce _any_ user-facing change?
Yes, the user facing Spark documentation is updated.
### How was this patch tested?
I built the website locally and manually tested the pages.
Closes #40087 from allanf-db/spark_docs.
Authored-by: Allan Folting <al...@databricks.com>
Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
docs/sql-data-sources-load-save-functions.md | 96 +++++++++----------
docs/sql-getting-started.md | 135 ++++++++++++++-------------
2 files changed, 113 insertions(+), 118 deletions(-)
diff --git a/docs/sql-data-sources-load-save-functions.md b/docs/sql-data-sources-load-save-functions.md
index 25df34ef5b0..c6cf8054f5f 100644
--- a/docs/sql-data-sources-load-save-functions.md
+++ b/docs/sql-data-sources-load-save-functions.md
@@ -28,6 +28,11 @@ In the simplest form, the default data source (`parquet` unless otherwise config
<div class="codetabs">
+
+<div data-lang="python" markdown="1">
+{% include_example generic_load_save_functions python/sql/datasource.py %}
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example generic_load_save_functions scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
@@ -36,16 +41,10 @@ In the simplest form, the default data source (`parquet` unless otherwise config
{% include_example generic_load_save_functions java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>
-<div data-lang="python" markdown="1">
-
-{% include_example generic_load_save_functions python/sql/datasource.py %}
-</div>
-
<div data-lang="r" markdown="1">
-
{% include_example generic_load_save_functions r/RSparkSQLExample.R %}
-
</div>
+
</div>
### Manually Specifying Options
@@ -64,6 +63,11 @@ as well. For other formats, refer to the API documentation of the particular for
To load a JSON file you can use:
<div class="codetabs">
+
+<div data-lang="python" markdown="1">
+{% include_example manual_load_options python/sql/datasource.py %}
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example manual_load_options scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
@@ -72,18 +76,20 @@ To load a JSON file you can use:
{% include_example manual_load_options java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>
-<div data-lang="python" markdown="1">
-{% include_example manual_load_options python/sql/datasource.py %}
-</div>
-
<div data-lang="r" markdown="1">
{% include_example manual_load_options r/RSparkSQLExample.R %}
</div>
+
</div>
To load a CSV file you can use:
<div class="codetabs">
+
+<div data-lang="python" markdown="1">
+{% include_example manual_load_options_csv python/sql/datasource.py %}
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example manual_load_options_csv scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
@@ -92,14 +98,10 @@ To load a CSV file you can use:
{% include_example manual_load_options_csv java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>
-<div data-lang="python" markdown="1">
-{% include_example manual_load_options_csv python/sql/datasource.py %}
-</div>
-
<div data-lang="r" markdown="1">
{% include_example manual_load_options_csv r/RSparkSQLExample.R %}
-
</div>
+
</div>
The extra options are also used during write operation.
@@ -113,6 +115,10 @@ ORC data source:
<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example manual_save_options_orc python/sql/datasource.py %}
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example manual_save_options_orc scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
@@ -121,16 +127,11 @@ ORC data source:
{% include_example manual_save_options_orc java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>
-<div data-lang="python" markdown="1">
-{% include_example manual_save_options_orc python/sql/datasource.py %}
-</div>
-
<div data-lang="r" markdown="1">
{% include_example manual_save_options_orc r/RSparkSQLExample.R %}
</div>
<div data-lang="SQL" markdown="1">
-
{% highlight sql %}
CREATE TABLE users_with_options (
name STRING,
@@ -143,7 +144,6 @@ OPTIONS (
orc.column.encoding.direct 'name'
)
{% endhighlight %}
-
</div>
</div>
@@ -152,6 +152,10 @@ Parquet data source:
<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example manual_save_options_parquet python/sql/datasource.py %}
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example manual_save_options_parquet scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
@@ -160,16 +164,11 @@ Parquet data source:
{% include_example manual_save_options_parquet java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>
-<div data-lang="python" markdown="1">
-{% include_example manual_save_options_parquet python/sql/datasource.py %}
-</div>
-
<div data-lang="r" markdown="1">
{% include_example manual_save_options_parquet r/RSparkSQLExample.R %}
</div>
<div data-lang="SQL" markdown="1">
-
{% highlight sql %}
CREATE TABLE users_with_options (
name STRING,
@@ -183,7 +182,6 @@ OPTIONS (
parquet.page.write-checksum.enabled true
)
{% endhighlight %}
-
</div>
</div>
@@ -194,6 +192,11 @@ Instead of using read API to load a file into DataFrame and query it, you can al
file directly with SQL.
<div class="codetabs">
+
+<div data-lang="python" markdown="1">
+{% include_example direct_sql python/sql/datasource.py %}
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example direct_sql scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
@@ -202,14 +205,10 @@ file directly with SQL.
{% include_example direct_sql java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>
-<div data-lang="python" markdown="1">
-{% include_example direct_sql python/sql/datasource.py %}
-</div>
-
<div data-lang="r" markdown="1">
{% include_example direct_sql r/RSparkSQLExample.R %}
-
</div>
+
</div>
### Save Modes
@@ -287,6 +286,10 @@ Bucketing and sorting are applicable only to persistent tables:
<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
@@ -295,12 +298,7 @@ Bucketing and sorting are applicable only to persistent tables:
{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>
-<div data-lang="python" markdown="1">
-{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
-</div>
-
<div data-lang="SQL" markdown="1">
-
{% highlight sql %}
CREATE TABLE users_bucketed_by_name(
@@ -311,9 +309,9 @@ CREATE TABLE users_bucketed_by_name(
CLUSTERED BY(name) INTO 42 BUCKETS;
{% endhighlight %}
-
</div>
+
</div>
while partitioning can be used with both `save` and `saveAsTable` when using the Dataset APIs.
@@ -321,6 +319,10 @@ while partitioning can be used with both `save` and `saveAsTable` when using the
<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example write_partitioning python/sql/datasource.py %}
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
@@ -329,12 +331,7 @@ while partitioning can be used with both `save` and `saveAsTable` when using the
{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>
-<div data-lang="python" markdown="1">
-{% include_example write_partitioning python/sql/datasource.py %}
-</div>
-
<div data-lang="SQL" markdown="1">
-
{% highlight sql %}
CREATE TABLE users_by_favorite_color(
@@ -344,7 +341,6 @@ CREATE TABLE users_by_favorite_color(
) USING csv PARTITIONED BY(favorite_color);
{% endhighlight %}
-
</div>
</div>
@@ -353,6 +349,10 @@ It is possible to use both partitioning and bucketing for a single table:
<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example write_partition_and_bucket python/sql/datasource.py %}
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
@@ -361,12 +361,7 @@ It is possible to use both partitioning and bucketing for a single table:
{% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>
-<div data-lang="python" markdown="1">
-{% include_example write_partition_and_bucket python/sql/datasource.py %}
-</div>
-
<div data-lang="SQL" markdown="1">
-
{% highlight sql %}
CREATE TABLE users_bucketed_and_partitioned(
@@ -378,7 +373,6 @@ PARTITIONED BY (favorite_color)
CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
{% endhighlight %}
-
</div>
</div>
diff --git a/docs/sql-getting-started.md b/docs/sql-getting-started.md
index 69396924e35..85da88a15c7 100644
--- a/docs/sql-getting-started.md
+++ b/docs/sql-getting-started.md
@@ -25,6 +25,13 @@ license: |
## Starting Point: SparkSession
<div class="codetabs">
+<div data-lang="python" markdown="1">
+
+The entry point into all functionality in Spark is the [`SparkSession`](api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder`:
+
+{% include_example init_session python/sql/basic.py %}
+</div>
+
<div data-lang="scala" markdown="1">
The entry point into all functionality in Spark is the [`SparkSession`](api/scala/org/apache/spark/sql/SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
@@ -39,13 +46,6 @@ The entry point into all functionality in Spark is the [`SparkSession`](api/java
{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
</div>
-<div data-lang="python" markdown="1">
-
-The entry point into all functionality in Spark is the [`SparkSession`](api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder`:
-
-{% include_example init_session python/sql/basic.py %}
-</div>
-
<div data-lang="r" markdown="1">
The entry point into all functionality in Spark is the [`SparkSession`](api/R/reference/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`:
@@ -63,31 +63,31 @@ To use these features, you do not need to have an existing Hive setup.
## Creating DataFrames
<div class="codetabs">
-<div data-lang="scala" markdown="1">
+<div data-lang="python" markdown="1">
With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
from a Hive table, or from [Spark data sources](sql-data-sources.html).
As an example, the following creates a DataFrame based on the content of a JSON file:
-{% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
+{% include_example create_df python/sql/basic.py %}
</div>
-<div data-lang="java" markdown="1">
+<div data-lang="scala" markdown="1">
With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
from a Hive table, or from [Spark data sources](sql-data-sources.html).
As an example, the following creates a DataFrame based on the content of a JSON file:
-{% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
+{% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
</div>
-<div data-lang="python" markdown="1">
+<div data-lang="java" markdown="1">
With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
from a Hive table, or from [Spark data sources](sql-data-sources.html).
As an example, the following creates a DataFrame based on the content of a JSON file:
-{% include_example create_df python/sql/basic.py %}
+{% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
</div>
<div data-lang="r" markdown="1">
@@ -111,6 +111,21 @@ As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala
Here we include some basic examples of structured data processing using Datasets:
<div class="codetabs">
+
+<div data-lang="python" markdown="1">
+In Python, it's possible to access a DataFrame's columns either by attribute
+(`df.age`) or by indexing (`df['age']`). While the former is convenient for
+interactive data exploration, users are highly encouraged to use the
+latter form, which is future proof and won't break with column names that
+are also attributes on the DataFrame class.
+
+{% include_example untyped_ops python/sql/basic.py %}
+For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/reference/pyspark.sql.html#dataframe-apis).
+
+In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/reference/pyspark.sql.html#functions).
+
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
@@ -128,20 +143,6 @@ For a complete list of the types of operations that can be performed on a Datase
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
</div>
-<div data-lang="python" markdown="1">
-In Python, it's possible to access a DataFrame's columns either by attribute
-(`df.age`) or by indexing (`df['age']`). While the former is convenient for
-interactive data exploration, users are highly encouraged to use the
-latter form, which is future proof and won't break with column names that
-are also attributes on the DataFrame class.
-
-{% include_example untyped_ops python/sql/basic.py %}
-For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/reference/pyspark.sql.html#dataframe-apis).
-
-In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/reference/pyspark.sql.html#functions).
-
-</div>
-
<div data-lang="r" markdown="1">
{% include_example untyped_ops r/RSparkSQLExample.R %}
@@ -157,6 +158,13 @@ In addition to simple column references and expressions, DataFrames also have a
## Running SQL Queries Programmatically
<div class="codetabs">
+
+<div data-lang="python" markdown="1">
+The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
+
+{% include_example run_sql python/sql/basic.py %}
+</div>
+
<div data-lang="scala" markdown="1">
The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
@@ -169,12 +177,6 @@ The `sql` function on a `SparkSession` enables applications to run SQL queries p
{% include_example run_sql java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
</div>
-<div data-lang="python" markdown="1">
-The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
-
-{% include_example run_sql python/sql/basic.py %}
-</div>
-
<div data-lang="r" markdown="1">
The `sql` function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.
@@ -193,6 +195,11 @@ view is tied to a system preserved database `global_temp`, and we must use the q
refer it, e.g. `SELECT * FROM global_temp.view1`.
<div class="codetabs">
+
+<div data-lang="python" markdown="1">
+{% include_example global_temp_view python/sql/basic.py %}
+</div>
+
<div data-lang="scala" markdown="1">
{% include_example global_temp_view scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
</div>
@@ -201,21 +208,14 @@ refer it, e.g. `SELECT * FROM global_temp.view1`.
{% include_example global_temp_view java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
</div>
-<div data-lang="python" markdown="1">
-{% include_example global_temp_view python/sql/basic.py %}
-</div>
-
<div data-lang="SQL" markdown="1">
-
{% highlight sql %}
-
CREATE GLOBAL TEMPORARY VIEW temp_view AS SELECT a + 1, b * 2 FROM tbl
SELECT * FROM global_temp.temp_view
-
{% endhighlight %}
-
</div>
+
</div>
@@ -229,6 +229,7 @@ that allows Spark to perform many operations like filtering, sorting and hashing
the bytes back into an object.
<div class="codetabs">
+
<div data-lang="scala" markdown="1">
{% include_example create_ds scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
</div>
@@ -252,6 +253,15 @@ you to construct Datasets when the columns and their types are not known until r
### Inferring the Schema Using Reflection
<div class="codetabs">
+<div data-lang="python" markdown="1">
+
+Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
+key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
+and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.
+
+{% include_example schema_inferring python/sql/basic.py %}
+</div>
+
<div data-lang="scala" markdown="1">
The Scala interface for Spark SQL supports automatically converting an RDD containing case classes
@@ -276,21 +286,29 @@ Serializable and has getters and setters for all of its fields.
{% include_example schema_inferring java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
</div>
-<div data-lang="python" markdown="1">
-
-Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
-key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
-and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.
-
-{% include_example schema_inferring python/sql/basic.py %}
-</div>
-
</div>
### Programmatically Specifying the Schema
<div class="codetabs">
+<div data-lang="python" markdown="1">
+
+When a dictionary of kwargs cannot be defined ahead of time (for example,
+the structure of records is encoded in a string, or a text dataset will be parsed and
+fields will be projected differently for different users),
+a `DataFrame` can be created programmatically with three steps.
+
+1. Create an RDD of tuples or lists from the original RDD;
+2. Create the schema represented by a `StructType` matching the structure of
+tuples or lists in the RDD created in the step 1.
+3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`.
+
+For example:
+
+{% include_example programmatic_schema python/sql/basic.py %}
+</div>
+
<div data-lang="scala" markdown="1">
When case classes cannot be defined ahead of time (for example,
@@ -327,23 +345,6 @@ For example:
{% include_example programmatic_schema java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
</div>
-<div data-lang="python" markdown="1">
-
-When a dictionary of kwargs cannot be defined ahead of time (for example,
-the structure of records is encoded in a string, or a text dataset will be parsed and
-fields will be projected differently for different users),
-a `DataFrame` can be created programmatically with three steps.
-
-1. Create an RDD of tuples or lists from the original RDD;
-2. Create the schema represented by a `StructType` matching the structure of
-tuples or lists in the RDD created in the step 1.
-3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`.
-
-For example:
-
-{% include_example programmatic_schema python/sql/basic.py %}
-</div>
-
</div>
## Scalar Functions
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org