You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2023/03/02 00:19:18 UTC

[spark] branch master updated: [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new ef334ae6a68 [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide
ef334ae6a68 is described below

commit ef334ae6a6889bbfba8b6c7afeb71b1ca1df87eb
Author: Allan Folting <al...@databricks.com>
AuthorDate: Thu Mar 2 09:19:05 2023 +0900

    [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide
    
    ### What changes were proposed in this pull request?
    Making Python the first tab for code examples in the Spark SQL, DataFrames and Datasets Guide.
    
    ### Why are the changes needed?
    Python is the easiest approachable and most popular language so this change moves it to the first tab (showing by default) for code examples.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, the user facing Spark documentation is updated.
    
    ### How was this patch tested?
    I built the website locally and manually tested the pages.
    
    Closes #40087 from allanf-db/spark_docs.
    
    Authored-by: Allan Folting <al...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 docs/sql-data-sources-load-save-functions.md |  96 +++++++++----------
 docs/sql-getting-started.md                  | 135 ++++++++++++++-------------
 2 files changed, 113 insertions(+), 118 deletions(-)

diff --git a/docs/sql-data-sources-load-save-functions.md b/docs/sql-data-sources-load-save-functions.md
index 25df34ef5b0..c6cf8054f5f 100644
--- a/docs/sql-data-sources-load-save-functions.md
+++ b/docs/sql-data-sources-load-save-functions.md
@@ -28,6 +28,11 @@ In the simplest form, the default data source (`parquet` unless otherwise config
 
 
 <div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% include_example generic_load_save_functions python/sql/datasource.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example generic_load_save_functions scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
 </div>
@@ -36,16 +41,10 @@ In the simplest form, the default data source (`parquet` unless otherwise config
 {% include_example generic_load_save_functions java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-
-{% include_example generic_load_save_functions python/sql/datasource.py %}
-</div>
-
 <div data-lang="r"  markdown="1">
-
 {% include_example generic_load_save_functions r/RSparkSQLExample.R %}
-
 </div>
+
 </div>
 
 ### Manually Specifying Options
@@ -64,6 +63,11 @@ as well. For other formats, refer to the API documentation of the particular for
 To load a JSON file you can use:
 
 <div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% include_example manual_load_options python/sql/datasource.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example manual_load_options scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
 </div>
@@ -72,18 +76,20 @@ To load a JSON file you can use:
 {% include_example manual_load_options java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-{% include_example manual_load_options python/sql/datasource.py %}
-</div>
-
 <div data-lang="r"  markdown="1">
 {% include_example manual_load_options r/RSparkSQLExample.R %}
 </div>
+
 </div>
 
 To load a CSV file you can use:
 
 <div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% include_example manual_load_options_csv python/sql/datasource.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example manual_load_options_csv scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
 </div>
@@ -92,14 +98,10 @@ To load a CSV file you can use:
 {% include_example manual_load_options_csv java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-{% include_example manual_load_options_csv python/sql/datasource.py %}
-</div>
-
 <div data-lang="r"  markdown="1">
 {% include_example manual_load_options_csv r/RSparkSQLExample.R %}
-
 </div>
+
 </div>
 
 The extra options are also used during write operation.
@@ -113,6 +115,10 @@ ORC data source:
 
 <div class="codetabs">
 
+<div data-lang="python"  markdown="1">
+{% include_example manual_save_options_orc python/sql/datasource.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example manual_save_options_orc scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
 </div>
@@ -121,16 +127,11 @@ ORC data source:
 {% include_example manual_save_options_orc java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-{% include_example manual_save_options_orc python/sql/datasource.py %}
-</div>
-
 <div data-lang="r"  markdown="1">
 {% include_example manual_save_options_orc r/RSparkSQLExample.R %}
 </div>
 
 <div data-lang="SQL"  markdown="1">
-
 {% highlight sql %}
 CREATE TABLE users_with_options (
   name STRING,
@@ -143,7 +144,6 @@ OPTIONS (
   orc.column.encoding.direct 'name'
 )
 {% endhighlight %}
-
 </div>
 
 </div>
@@ -152,6 +152,10 @@ Parquet data source:
 
 <div class="codetabs">
 
+<div data-lang="python"  markdown="1">
+{% include_example manual_save_options_parquet python/sql/datasource.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example manual_save_options_parquet scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
 </div>
@@ -160,16 +164,11 @@ Parquet data source:
 {% include_example manual_save_options_parquet java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-{% include_example manual_save_options_parquet python/sql/datasource.py %}
-</div>
-
 <div data-lang="r"  markdown="1">
 {% include_example manual_save_options_parquet r/RSparkSQLExample.R %}
 </div>
 
 <div data-lang="SQL"  markdown="1">
-
 {% highlight sql %}
 CREATE TABLE users_with_options (
   name STRING,
@@ -183,7 +182,6 @@ OPTIONS (
   parquet.page.write-checksum.enabled true
 )
 {% endhighlight %}
-
 </div>
 
 </div>
@@ -194,6 +192,11 @@ Instead of using read API to load a file into DataFrame and query it, you can al
 file directly with SQL.
 
 <div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% include_example direct_sql python/sql/datasource.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example direct_sql scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
 </div>
@@ -202,14 +205,10 @@ file directly with SQL.
 {% include_example direct_sql java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-{% include_example direct_sql python/sql/datasource.py %}
-</div>
-
 <div data-lang="r"  markdown="1">
 {% include_example direct_sql r/RSparkSQLExample.R %}
-
 </div>
+
 </div>
 
 ### Save Modes
@@ -287,6 +286,10 @@ Bucketing and sorting are applicable only to persistent tables:
 
 <div class="codetabs">
 
+<div data-lang="python"  markdown="1">
+{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
 </div>
@@ -295,12 +298,7 @@ Bucketing and sorting are applicable only to persistent tables:
 {% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
-</div>
-
 <div data-lang="SQL"  markdown="1">
-
 {% highlight sql %}
 
 CREATE TABLE users_bucketed_by_name(
@@ -311,9 +309,9 @@ CREATE TABLE users_bucketed_by_name(
 CLUSTERED BY(name) INTO 42 BUCKETS;
 
 {% endhighlight %}
-
 </div>
 
+
 </div>
 
 while partitioning can be used with both `save` and `saveAsTable` when using the Dataset APIs.
@@ -321,6 +319,10 @@ while partitioning can be used with both `save` and `saveAsTable` when using the
 
 <div class="codetabs">
 
+<div data-lang="python"  markdown="1">
+{% include_example write_partitioning python/sql/datasource.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
 </div>
@@ -329,12 +331,7 @@ while partitioning can be used with both `save` and `saveAsTable` when using the
 {% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-{% include_example write_partitioning python/sql/datasource.py %}
-</div>
-
 <div data-lang="SQL"  markdown="1">
-
 {% highlight sql %}
 
 CREATE TABLE users_by_favorite_color(
@@ -344,7 +341,6 @@ CREATE TABLE users_by_favorite_color(
 ) USING csv PARTITIONED BY(favorite_color);
 
 {% endhighlight %}
-
 </div>
 
 </div>
@@ -353,6 +349,10 @@ It is possible to use both partitioning and bucketing for a single table:
 
 <div class="codetabs">
 
+<div data-lang="python"  markdown="1">
+{% include_example write_partition_and_bucket python/sql/datasource.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
 </div>
@@ -361,12 +361,7 @@ It is possible to use both partitioning and bucketing for a single table:
 {% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-{% include_example write_partition_and_bucket python/sql/datasource.py %}
-</div>
-
 <div data-lang="SQL"  markdown="1">
-
 {% highlight sql %}
 
 CREATE TABLE users_bucketed_and_partitioned(
@@ -378,7 +373,6 @@ PARTITIONED BY (favorite_color)
 CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
 
 {% endhighlight %}
-
 </div>
 
 </div>
diff --git a/docs/sql-getting-started.md b/docs/sql-getting-started.md
index 69396924e35..85da88a15c7 100644
--- a/docs/sql-getting-started.md
+++ b/docs/sql-getting-started.md
@@ -25,6 +25,13 @@ license: |
 ## Starting Point: SparkSession
 
 <div class="codetabs">
+<div data-lang="python"  markdown="1">
+
+The entry point into all functionality in Spark is the [`SparkSession`](api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder`:
+
+{% include_example init_session python/sql/basic.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 
 The entry point into all functionality in Spark is the [`SparkSession`](api/scala/org/apache/spark/sql/SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
@@ -39,13 +46,6 @@ The entry point into all functionality in Spark is the [`SparkSession`](api/java
 {% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-
-The entry point into all functionality in Spark is the [`SparkSession`](api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder`:
-
-{% include_example init_session python/sql/basic.py %}
-</div>
-
 <div data-lang="r"  markdown="1">
 
 The entry point into all functionality in Spark is the [`SparkSession`](api/R/reference/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`:
@@ -63,31 +63,31 @@ To use these features, you do not need to have an existing Hive setup.
 ## Creating DataFrames
 
 <div class="codetabs">
-<div data-lang="scala"  markdown="1">
+<div data-lang="python"  markdown="1">
 With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
 from a Hive table, or from [Spark data sources](sql-data-sources.html).
 
 As an example, the following creates a DataFrame based on the content of a JSON file:
 
-{% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
+{% include_example create_df python/sql/basic.py %}
 </div>
 
-<div data-lang="java" markdown="1">
+<div data-lang="scala"  markdown="1">
 With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
 from a Hive table, or from [Spark data sources](sql-data-sources.html).
 
 As an example, the following creates a DataFrame based on the content of a JSON file:
 
-{% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
+{% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
 </div>
 
-<div data-lang="python"  markdown="1">
+<div data-lang="java" markdown="1">
 With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
 from a Hive table, or from [Spark data sources](sql-data-sources.html).
 
 As an example, the following creates a DataFrame based on the content of a JSON file:
 
-{% include_example create_df python/sql/basic.py %}
+{% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
 </div>
 
 <div data-lang="r"  markdown="1">
@@ -111,6 +111,21 @@ As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala
 Here we include some basic examples of structured data processing using Datasets:
 
 <div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+In Python, it's possible to access a DataFrame's columns either by attribute
+(`df.age`) or by indexing (`df['age']`). While the former is convenient for
+interactive data exploration, users are highly encouraged to use the
+latter form, which is future proof and won't break with column names that
+are also attributes on the DataFrame class.
+
+{% include_example untyped_ops python/sql/basic.py %}
+For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/reference/pyspark.sql.html#dataframe-apis).
+
+In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/reference/pyspark.sql.html#functions).
+
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
 
@@ -128,20 +143,6 @@ For a complete list of the types of operations that can be performed on a Datase
 In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
 </div>
 
-<div data-lang="python"  markdown="1">
-In Python, it's possible to access a DataFrame's columns either by attribute
-(`df.age`) or by indexing (`df['age']`). While the former is convenient for
-interactive data exploration, users are highly encouraged to use the
-latter form, which is future proof and won't break with column names that
-are also attributes on the DataFrame class.
-
-{% include_example untyped_ops python/sql/basic.py %}
-For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/reference/pyspark.sql.html#dataframe-apis).
-
-In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/reference/pyspark.sql.html#functions).
-
-</div>
-
 <div data-lang="r"  markdown="1">
 
 {% include_example untyped_ops r/RSparkSQLExample.R %}
@@ -157,6 +158,13 @@ In addition to simple column references and expressions, DataFrames also have a
 ## Running SQL Queries Programmatically
 
 <div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
+
+{% include_example run_sql python/sql/basic.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
 
@@ -169,12 +177,6 @@ The `sql` function on a `SparkSession` enables applications to run SQL queries p
 {% include_example run_sql java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
-
-{% include_example run_sql python/sql/basic.py %}
-</div>
-
 <div data-lang="r"  markdown="1">
 The `sql` function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.
 
@@ -193,6 +195,11 @@ view is tied to a system preserved database `global_temp`, and we must use the q
 refer it, e.g. `SELECT * FROM global_temp.view1`.
 
 <div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% include_example global_temp_view python/sql/basic.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 {% include_example global_temp_view scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
 </div>
@@ -201,21 +208,14 @@ refer it, e.g. `SELECT * FROM global_temp.view1`.
 {% include_example global_temp_view java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-{% include_example global_temp_view python/sql/basic.py %}
-</div>
-
 <div data-lang="SQL"  markdown="1">
-
 {% highlight sql %}
-
 CREATE GLOBAL TEMPORARY VIEW temp_view AS SELECT a + 1, b * 2 FROM tbl
 
 SELECT * FROM global_temp.temp_view
-
 {% endhighlight %}
-
 </div>
+
 </div>
 
 
@@ -229,6 +229,7 @@ that allows Spark to perform many operations like filtering, sorting and hashing
 the bytes back into an object.
 
 <div class="codetabs">
+
 <div data-lang="scala"  markdown="1">
 {% include_example create_ds scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
 </div>
@@ -252,6 +253,15 @@ you to construct Datasets when the columns and their types are not known until r
 ### Inferring the Schema Using Reflection
 <div class="codetabs">
 
+<div data-lang="python"  markdown="1">
+
+Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
+key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
+and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.
+
+{% include_example schema_inferring python/sql/basic.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 
 The Scala interface for Spark SQL supports automatically converting an RDD containing case classes
@@ -276,21 +286,29 @@ Serializable and has getters and setters for all of its fields.
 {% include_example schema_inferring java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-
-Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
-key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
-and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.
-
-{% include_example schema_inferring python/sql/basic.py %}
-</div>
-
 </div>
 
 ### Programmatically Specifying the Schema
 
 <div class="codetabs">
 
+<div data-lang="python"  markdown="1">
+
+When a dictionary of kwargs cannot be defined ahead of time (for example,
+the structure of records is encoded in a string, or a text dataset will be parsed and
+fields will be projected differently for different users),
+a `DataFrame` can be created programmatically with three steps.
+
+1. Create an RDD of tuples or lists from the original RDD;
+2. Create the schema represented by a `StructType` matching the structure of
+tuples or lists in the RDD created in the step 1.
+3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`.
+
+For example:
+
+{% include_example programmatic_schema python/sql/basic.py %}
+</div>
+
 <div data-lang="scala"  markdown="1">
 
 When case classes cannot be defined ahead of time (for example,
@@ -327,23 +345,6 @@ For example:
 {% include_example programmatic_schema java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
 </div>
 
-<div data-lang="python"  markdown="1">
-
-When a dictionary of kwargs cannot be defined ahead of time (for example,
-the structure of records is encoded in a string, or a text dataset will be parsed and
-fields will be projected differently for different users),
-a `DataFrame` can be created programmatically with three steps.
-
-1. Create an RDD of tuples or lists from the original RDD;
-2. Create the schema represented by a `StructType` matching the structure of
-tuples or lists in the RDD created in the step 1.
-3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`.
-
-For example:
-
-{% include_example programmatic_schema python/sql/basic.py %}
-</div>
-
 </div>
 
 ## Scalar Functions


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org