You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2023/02/24 02:33:15 UTC

[spark] branch branch-3.4 updated: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
     new 12d94bd47c2 [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect
12d94bd47c2 is described below

commit 12d94bd47c2cf96c89ff27c43e220ea0fedb37f2
Author: itholic <ha...@databricks.com>
AuthorDate: Fri Feb 24 11:32:53 2023 +0900

    [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to add "Live Notebook: DataFrame with Spark Connect" at [Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/index.html) documents as below:
    
    <img width="794" alt="Screen Shot 2023-02-23 at 1 15 41 PM" src="https://user-images.githubusercontent.com/44108233/220820191-ca0e5705-1694-4eaa-8658-67d522af1bf8.png">
    
    Basically, the notebook copied the contents of 'Live Notebook: DataFrame', and updated the contents related to Spark Connect.
    
    The Notebook looks like the below:
    
    <img width="814" alt="Screen Shot 2023-02-23 at 1 15 54 PM" src="https://user-images.githubusercontent.com/44108233/220820218-bbfb6a58-7009-4327-aea4-72ed6496d77c.png">
    
    ### Why are the changes needed?
    
    To help quick start using DataFrame with Spark Connect for those who new to Spark Connect.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, it's documentation.
    
    ### How was this patch tested?
    
    Manually built the docs, and run the CI.
    
    Closes #40092 from itholic/SPARK-42475.
    
    Authored-by: itholic <ha...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
    (cherry picked from commit 3830285d56d8971c6948763aeb3d99c5bf5eca91)
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 binder/postBuild                                   |  14 +-
 python/docs/source/conf.py                         |   2 +
 python/docs/source/getting_started/index.rst       |   2 +
 .../getting_started/quickstart_connect.ipynb       | 142 +++++++++++++++++++++
 .../source/getting_started/quickstart_df.ipynb     |  33 -----
 5 files changed, 159 insertions(+), 34 deletions(-)

diff --git a/binder/postBuild b/binder/postBuild
index 34ead09f692..8ff3aaf5a14 100644
--- a/binder/postBuild
+++ b/binder/postBuild
@@ -32,11 +32,23 @@ else
   SPECIFIER="<="
 fi
 
-pip install plotly "pyspark[sql,ml,mllib,pandas_on_spark]$SPECIFIER$VERSION"
+if [[ ! $VERSION < "3.4.0" ]]; then
+  pip install plotly "pyspark[sql,ml,mllib,pandas_on_spark,connect]$SPECIFIER$VERSION"
+else
+  pip install plotly "pyspark[sql,ml,mllib,pandas_on_spark]$SPECIFIER$VERSION"
+fi
 
 # Set 'PYARROW_IGNORE_TIMEZONE' to surpress warnings from PyArrow.
 echo "export PYARROW_IGNORE_TIMEZONE=1" >> ~/.profile
 
+# Add sbin to PATH to run `start-connect-server.sh`.
+SPARK_HOME=$(python -c "from pyspark.find_spark_home import _find_spark_home; print(_find_spark_home())")
+echo "export PATH=${PATH}:${SPARK_HOME}/sbin" >> ~/.profile
+
+# Add Spark version to env for running command dynamically based on Spark version.
+SPARK_VERSION=$(python -c "import pyspark; print(pyspark.__version__)")
+echo "export SPARK_VERSION=${SPARK_VERSION}" >> ~/.profile
+
 # Surpress warnings from Spark jobs, and UI progress bar.
 mkdir -p ~/.ipython/profile_default/startup
 echo """from pyspark.sql import SparkSession
diff --git a/python/docs/source/conf.py b/python/docs/source/conf.py
index 5e1b14b574d..8203a802053 100644
--- a/python/docs/source/conf.py
+++ b/python/docs/source/conf.py
@@ -89,6 +89,8 @@ rst_epilog = """
 .. _binder_df: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb
 .. |binder_ps| replace:: Live Notebook: pandas API on Spark
 .. _binder_ps: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb
+.. |binder_connect| replace:: Live Notebook: Spark Connect
+.. _binder_connect: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb
 .. |examples| replace:: Examples
 .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
 .. |downloading| replace:: Downloading
diff --git a/python/docs/source/getting_started/index.rst b/python/docs/source/getting_started/index.rst
index a216e3bbe3b..3c1c7d80863 100644
--- a/python/docs/source/getting_started/index.rst
+++ b/python/docs/source/getting_started/index.rst
@@ -28,6 +28,7 @@ at `the Spark documentation <https://spark.apache.org/docs/latest/index.html#whe
 There are live notebooks where you can try PySpark out without any other step:
 
 * |binder_df|_
+* |binder_connect|_
 * |binder_ps|_
 
 The list below is the contents of this quickstart page:
@@ -37,4 +38,5 @@ The list below is the contents of this quickstart page:
 
    install
    quickstart_df
+   quickstart_connect
    quickstart_ps
diff --git a/python/docs/source/getting_started/quickstart_connect.ipynb b/python/docs/source/getting_started/quickstart_connect.ipynb
new file mode 100644
index 00000000000..54b4614796c
--- /dev/null
+++ b/python/docs/source/getting_started/quickstart_connect.ipynb
@@ -0,0 +1,142 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Quickstart: Spark Connect\n",
+    "\n",
+    "Spark Connect introduced a decoupled client-server architecture for Spark that allows remote connectivity to Spark clusters using the [DataFrame API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html?highlight=dataframe#pyspark.sql.DataFrame).\n",
+    "\n",
+    "This notebook walks through a simple step-by-step example of how to use Spark Connect to build any type of application that needs to leverage the power of Spark when working with data.\n",
+    "\n",
+    "Spark Connect includes both client and server components and we will show you how to set up and use both."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Launch Spark server with Spark Connect\n",
+    "\n",
+    "To launch Spark with support for Spark Connect sessions, run the `start-connect-server.sh` script."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:$SPARK_VERSION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Connect to Spark Connect server\n",
+    "\n",
+    "Now that the Spark server is running, we can connect to it remotely using Spark Connect. We do this by creating a remote Spark session on the client where our application runs. Before we can do that, we need to make sure to stop the existing regular Spark session because it cannot coexist with the remote Spark Connect session we are about to create."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.sql import SparkSession\n",
+    "\n",
+    "SparkSession.builder.master(\"local[*]\").getOrCreate().stop()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The command we used above to launch the server configured Spark to run as `localhost:15002`. So now we can create a remote Spark session on the client using the following command."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "spark = SparkSession.builder.remote(\"sc://localhost:15002\").getOrCreate()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create DataFrame\n",
+    "\n",
+    "Once the remote Spark session is created successfully, it can be used the same way as a regular Spark session. Therefore, you can create a DataFrame with the following command."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+---+---+-------+----------+-------------------+\n",
+      "|  a|  b|      c|         d|                  e|\n",
+      "+---+---+-------+----------+-------------------+\n",
+      "|  1|2.0|string1|2000-01-01|2000-01-01 12:00:00|\n",
+      "|  2|3.0|string2|2000-02-01|2000-01-02 12:00:00|\n",
+      "|  4|5.0|string3|2000-03-01|2000-01-03 12:00:00|\n",
+      "+---+---+-------+----------+-------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from datetime import datetime, date\n",
+    "from pyspark.sql import Row\n",
+    "\n",
+    "df = spark.createDataFrame([\n",
+    "    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),\n",
+    "    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),\n",
+    "    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))\n",
+    "])\n",
+    "df.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "See 'Live Notebook: DataFrame' at [the quickstart page](https://spark.apache.org/docs/latest/api/python/getting_started/index.html) for more detail usage of DataFrame API."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.11"
+  },
+  "name": "quickstart",
+  "notebookId": 1927513300154480
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/python/docs/source/getting_started/quickstart_df.ipynb b/python/docs/source/getting_started/quickstart_df.ipynb
index af0e5ee5052..f1c04c8bf11 100644
--- a/python/docs/source/getting_started/quickstart_df.ipynb
+++ b/python/docs/source/getting_started/quickstart_df.ipynb
@@ -133,39 +133,6 @@
     "df"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Create a PySpark DataFrame from an RDD consisting of a list of tuples."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]"
-      ]
-     },
-     "execution_count": 5,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "rdd = spark.sparkContext.parallelize([\n",
-    "    (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),\n",
-    "    (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),\n",
-    "    (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))\n",
-    "])\n",
-    "df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])\n",
-    "df"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org