You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2023/02/14 13:00:35 UTC

[spark] branch branch-3.4 updated: [SPARK-42418][DOCS][PYTHON] PySpark documentation updates to improve discoverability and add more guidance

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
     new 25bbb5c9126 [SPARK-42418][DOCS][PYTHON] PySpark documentation updates to improve discoverability and add more guidance
25bbb5c9126 is described below

commit 25bbb5c9126054280c464b64138d0ab145305a2b
Author: Allan Folting <al...@databricks.com>
AuthorDate: Tue Feb 14 22:00:07 2023 +0900

    [SPARK-42418][DOCS][PYTHON] PySpark documentation updates to improve discoverability and add more guidance
    
    ### What changes were proposed in this pull request?
    Updates to the PySpark documentation web pages that help users choose which API to use when and make it easier to discover relevant content and navigate the documentation pages. This is the first of a series of updates.
    
    ### Why are the changes needed?
    The PySpark documentation web site does not do enough to help users choose which API to use and it is not easy to navigate.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, the user facing PySpark documentation is updated.
    
    ### How was this patch tested?
    Built and tested the PySpark documentation web site locally.
    
    Closes #39992 from allanf-db/pyspark_doc_updates.
    
    Authored-by: Allan Folting <al...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
    (cherry picked from commit fa7c13add487984ae84527ecd254b5c396ba9fee)
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 docs/img/pyspark-components.png               | Bin 32727 -> 0 bytes
 docs/img/pyspark-machine_learning.png         | Bin 0 -> 14963 bytes
 docs/img/pyspark-pandas_api_on_spark.png      | Bin 0 -> 15765 bytes
 docs/img/pyspark-spark_core_and_rdds.png      | Bin 0 -> 19844 bytes
 docs/img/pyspark-spark_sql_and_dataframes.png | Bin 0 -> 17644 bytes
 docs/img/pyspark-structured_streaming.png     | Bin 0 -> 15271 bytes
 python/docs/source/_static/css/pyspark.css    |   3 +
 python/docs/source/index.rst                  | 158 ++++++++++++++++++++------
 8 files changed, 127 insertions(+), 34 deletions(-)

diff --git a/docs/img/pyspark-components.png b/docs/img/pyspark-components.png
deleted file mode 100644
index a0979d3465a..00000000000
Binary files a/docs/img/pyspark-components.png and /dev/null differ
diff --git a/docs/img/pyspark-machine_learning.png b/docs/img/pyspark-machine_learning.png
new file mode 100644
index 00000000000..7f4e6286f20
Binary files /dev/null and b/docs/img/pyspark-machine_learning.png differ
diff --git a/docs/img/pyspark-pandas_api_on_spark.png b/docs/img/pyspark-pandas_api_on_spark.png
new file mode 100644
index 00000000000..b4b291b3440
Binary files /dev/null and b/docs/img/pyspark-pandas_api_on_spark.png differ
diff --git a/docs/img/pyspark-spark_core_and_rdds.png b/docs/img/pyspark-spark_core_and_rdds.png
new file mode 100644
index 00000000000..8d06a446c1a
Binary files /dev/null and b/docs/img/pyspark-spark_core_and_rdds.png differ
diff --git a/docs/img/pyspark-spark_sql_and_dataframes.png b/docs/img/pyspark-spark_sql_and_dataframes.png
new file mode 100644
index 00000000000..acd8b280de1
Binary files /dev/null and b/docs/img/pyspark-spark_sql_and_dataframes.png differ
diff --git a/docs/img/pyspark-structured_streaming.png b/docs/img/pyspark-structured_streaming.png
new file mode 100644
index 00000000000..b49bb5b2755
Binary files /dev/null and b/docs/img/pyspark-structured_streaming.png differ
diff --git a/python/docs/source/_static/css/pyspark.css b/python/docs/source/_static/css/pyspark.css
index 1e493c4c868..89b7c65f27a 100644
--- a/python/docs/source/_static/css/pyspark.css
+++ b/python/docs/source/_static/css/pyspark.css
@@ -92,3 +92,6 @@ u.bd-sidebar .nav>li>ul>.active:hover>a,.bd-sidebar .nav>li>ul>.active>a {
     border-left: 2px solid #1B5162!important;
 }
 
+.spec_table tr, td, th {
+    border-top: none!important;
+}
diff --git a/python/docs/source/index.rst b/python/docs/source/index.rst
index 7f650b79a1a..b3233744c5e 100644
--- a/python/docs/source/index.rst
+++ b/python/docs/source/index.rst
@@ -17,58 +17,148 @@
 
 .. PySpark documentation master file
 
-=====================
-PySpark Documentation
-=====================
+=================
+PySpark Overview
+=================
 
-|binder|_ | `GitHub <https://github.com/apache/spark>`_ | `Issues <https://issues.apache.org/jira/projects/SPARK/issues>`_ | |examples|_ | `Community <https://spark.apache.org/community.html>`_
-
-PySpark is an interface for Apache Spark in Python. It not only allows you to write
-Spark applications using Python APIs, but also provides the PySpark shell for
-interactively analyzing your data in a distributed environment. PySpark supports most
-of Spark's features such as Spark SQL, DataFrame, Streaming, MLlib
-(Machine Learning) and Spark Core.
+**Date**: |today| **Version**: |release|
 
-.. image:: ../../../docs/img/pyspark-components.png
-  :alt: PySpark Components
+**Useful links**:
+|binder|_ | `GitHub <https://github.com/apache/spark>`_ | `Issues <https://issues.apache.org/jira/projects/SPARK/issues>`_ | |examples|_ | `Community <https://spark.apache.org/community.html>`_
 
-**Spark SQL and DataFrame**
+PySpark is the Python API for Apache Spark. It enables you to perform real-time,
+large-scale data processing in a distributed environment using Python. It also provides a PySpark
+shell for interactively analyzing your data.
+
+PySpark combines Python's learnability and ease of use with the power of Apache Spark
+to enable processing and analysis of data at any size for everyone familiar with Python.
+
+PySpark supports all of Spark's features such as Spark SQL,
+DataFrames, Structured Streaming, Machine Learning (MLlib) and Spark Core.
+
+.. list-table::
+   :widths: 10 20 20 20 20 10
+   :header-rows: 0
+   :class: borderless spec_table
+
+   * -
+     - .. image:: ../../../docs/img/pyspark-spark_sql_and_dataframes.png
+          :target: reference/pyspark.sql/index.html
+          :width: 100%
+          :alt: Spark SQL
+     - .. image:: ../../../docs/img/pyspark-pandas_api_on_spark.png
+          :target: reference/pyspark.pandas/index.html
+          :width: 100%
+          :alt: Pandas API on Spark
+     - .. image:: ../../../docs/img/pyspark-structured_streaming.png
+          :target: reference/pyspark.ss/index.html
+          :width: 100%
+          :alt: Streaming
+     - .. image:: ../../../docs/img/pyspark-machine_learning.png
+          :target: reference/pyspark.ml.html
+          :width: 100%
+          :alt: Machine Learning
+     -
+
+.. list-table::
+   :widths: 10 80 10
+   :header-rows: 0
+   :class: borderless spec_table
+
+   * -
+     - .. image:: ../../../docs/img/pyspark-spark_core_and_rdds.png
+          :target: reference/pyspark.html
+          :width: 100%
+          :alt: Spark Core and RDDs
+     -
+
+.. _Index Page - Spark SQL and DataFrames:
+
+**Spark SQL and DataFrames**
+
+Spark SQL is Apache Spark's module for working with structured data.
+It allows you to seamlessly mix SQL queries with Spark programs.
+With PySpark DataFrames you can efficiently read, write, transform,
+and analyze data using Python and SQL.
+Whether you use Python or SQL, the same underlying execution
+engine is used so you will always leverage the full power of Spark.
+
+- :ref:`/getting_started/quickstart_df.ipynb`
+- |binder_df|_
+- :ref:`Spark SQL API Reference</reference/pyspark.sql/index.rst>`
+
+**Pandas API on Spark**
+
+Pandas API on Spark allows you to scale your pandas workload to any size
+by running it distributed across multiple nodes. If you are already familiar
+with pandas and want to leverage Spark for big data, pandas API on Spark makes
+you immediately productive and lets you migrate your applications without modifying the code.
+You can have a single codebase that works both with pandas (tests, smaller datasets)
+and with Spark (production, distributed datasets) and you can switch between the
+pandas API and the Pandas API on Spark easily and without overhead.
+
+Pandas API on Spark aims to make the transition from pandas to Spark easy but
+if you are new to Spark or deciding which API to use, we recommend using PySpark
+(see :ref:`Spark SQL and DataFrames <Index Page - Spark SQL and DataFrames>`).
+
+- :ref:`/getting_started/quickstart_ps.ipynb`
+- |binder_ps|_
+- :ref:`Pandas API on Spark Reference</reference/pyspark.pandas/index.rst>`
+
+.. _Index Page - Structured Streaming:
+
+**Structured Streaming**
+
+Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
+You can express your streaming computation the same way you would express a batch computation on static data.
+The Spark SQL engine will take care of running it incrementally and continuously and updating the final result
+as streaming data continues to arrive.
+
+- `Structured Streaming Programming Guide <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>`_
+- :ref:`Structured Streaming API Reference</reference/pyspark.ss/index.rst>`
+
+**Machine Learning (MLlib)**
 
-Spark SQL is a Spark module for structured data processing. It provides
-a programming abstraction called DataFrame and can also act as distributed
-SQL query engine.
+Built on top of Spark, MLlib is a scalable machine learning library that provides
+a uniform set of high-level APIs that help users create and tune practical machine
+learning pipelines.
 
-**pandas API on Spark**
+- `Machine Learning Library (MLlib) Programming Guide <https://spark.apache.org/docs/latest/ml-guide.html>`_
+- :ref:`Machine Learning (MLlib) API Reference</reference/pyspark.ml.rst>`
 
-pandas API on Spark allows you to scale your pandas workload out.
-With this package, you can:
+**Spark Core and RDDs**
 
-* Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.
-* Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).
-* Switch to pandas API and PySpark API contexts easily without any overhead.
+Spark Core is the underlying general execution engine for the Spark platform that all
+other functionality is built on top of. It provides RDDs (Resilient Distributed Datasets)
+and in-memory computing capabilities.
 
-**Streaming**
+Note that the RDD API is a low-level API which can be difficult to use and you do not get
+the benefit of Spark's automatic query optimization capabilities.
+We recommend using DataFrames (see :ref:`Spark SQL and DataFrames <Index Page - Spark SQL and DataFrames>` above)
+instead of RDDs as it allows you to express what you want more easily and lets Spark automatically
+construct the most efficient query for you.
 
-Running on top of Spark, the streaming feature in Apache Spark enables powerful
-interactive and analytical applications across both streaming and historical data,
-while inheriting Spark's ease of use and fault tolerance characteristics.
+- :ref:`Spark Core API Reference</reference/pyspark.rst>`
 
-**MLlib**
+**Spark Streaming (Legacy)**
 
-Built on top of Spark, MLlib is a scalable machine learning library that provides
-a uniform set of high-level APIs that help users create and tune practical machine
-learning pipelines.
+Spark Streaming is an extension of the core Spark API that enables scalable,
+high-throughput, fault-tolerant stream processing of live data streams.
 
-**Spark Core**
+Note that Spark Streaming is the previous generation of Spark's streaming engine.
+It is a legacy project and it is no longer being updated.
+There is a newer and easier to use streaming engine in Spark called
+:ref:`Structured Streaming <Index Page - Structured Streaming>` which you
+should use for your streaming applications and pipelines.
 
-Spark Core is the underlying general execution engine for the Spark platform that all
-other functionality is built on top of. It provides an RDD (Resilient Distributed Dataset)
-and in-memory computing capabilities.
+- `Spark Streaming Programming Guide (Legacy) <https://spark.apache.org/docs/latest/streaming-programming-guide.html>`_
+- :ref:`Spark Streaming API Reference (Legacy)</reference/pyspark.streaming.rst>`
 
 .. toctree::
     :maxdepth: 2
     :hidden:
 
+    Overview <self>
     getting_started/index
     user_guide/index
     reference/index


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org