You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "allanf-db (via GitHub)" <gi...@apache.org> on 2023/03/08 02:44:27 UTC

[GitHub] [spark] allanf-db opened a new pull request, #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

allanf-db opened a new pull request, #40324:
URL: https://github.com/apache/spark/pull/40324

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Adding a Spark Connect overview page to the Spark 3.4 documentation.
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   The first version of Spark Connect is released as part of Spark 3.4.0 and this adds an overview for it to the documentation.
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the user facing documentation is updated.
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   Built the doc website locally and tested the pages.
   SKIP_SCALADOC=1 SKIP_RDOC=1 bundle exec jekyll build
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "allanf-db (via GitHub)" <gi...@apache.org>.
allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130590417


##########
docs/spark-connect-overview.md:
##########
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Spark Connect Overview - Building client-side Spark applications
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-api.png" title="Spark Connect API" alt="Spark Connect API Diagram" />
+</p>
+
+# How Spark Connect Works
+
+The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark's DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved logical query plans which are encoded using protocol buffers. These are sent to the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and translates unresolved logical plans into Spark's logical plan operators. This is similar to parsing a SQL query, where attributes and relations are parsed and an initial parse plan is built. From there, the standard Spark execution process kicks in, ensuring that Spark Connect leverages all of Spark's optimizations and enhancements. Results are streamed back to the client via gRPC as Apache Arrow-encoded row batches.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-communication.png" title="Spark Connect communication" alt="Spark Connect communication" />
+</p>
+
+# Operational Benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several operational issues:
+
+**Stability**: Applications that use too much memory will now only impact their own environment as they can run in their own processes. Users can define their own dependencies on the client and don't need to worry about potential conflicts with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded independently of applications, e.g. to benefit from performance improvements and security fixes. This means applications can be forward-compatible, as long as the server-side RPC definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive debugging during development directly from your favorite IDE. Similarly, applications can be monitored using the application's framework native metrics and logging libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark applications. When creating a Spark session, you can specify that you want to use Spark Connect and there are a few ways to do that as outlined below.
+
+If you do not use one of the mechanisms outlined below, your Spark session will work just like before, without leveraging Spark Connect, and your application code will run on the Spark driver node.
+
+## Set SPARK_REMOTE environment variable
+
+If you set the SPARK_REMOTE environment variable on the client machine where your Spark client application is running and create a new Spark Session as illustrated below, the session will be a Spark Connect session. With this approach, there is no code change needed to start using Spark Connect.
+
+Set SPARK_REMOTE environment variable:
+
+{% highlight bash %}
+    export SPARK_REMOTE="sc://localhost/"
+{% endhighlight %}

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130622567


##########
docs/index.md:
##########
@@ -49,8 +49,19 @@ For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is required additional
 
 # Running the Examples and Shell
 
-Spark comes with several sample programs.  Scala, Java, Python and R examples are in the
-`examples/src/main` directory. To run one of the Java or Scala sample programs, use
+Spark comes with several sample programs. Python, Scala, Java and R examples are in the
+`examples/src/main` directory.
+
+To run Spark interactively in a Python interpreter, use
+`bin/pyspark`:
+
+    ./bin/pyspark --master local[2]

Review Comment:
   Sounds good!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40324: [SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1136451520


##########
docs/spark-connect-overview.md:
##########
@@ -0,0 +1,259 @@
+---
+layout: global
+title: Spark Connect Overview
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+**Building client-side Spark applications**
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server
+architecture that allows remote connectivity to Spark clusters using the
+DataFrame API and unresolved logical plans as the protocol. The separation
+between client and server allows Spark and its open ecosystem to be
+leveraged from everywhere. It can be embedded in modern data applications,
+in IDEs, Notebooks and programming languages.
+
+To get started, see [Quickstart: Spark Connect](api/python/getting_started/quickstart_connect.html).
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-api.png" title="Spark Connect API" alt="Spark Connect API Diagram" />
+</p>
+
+# How Spark Connect works
+
+The Spark Connect client library is designed to simplify Spark application
+development. It is a thin API that can be embedded everywhere: in application
+servers, IDEs, notebooks, and programming languages. The Spark Connect API
+builds on Spark's DataFrame API using unresolved logical plans as a
+language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved
+logical query plans which are encoded using protocol buffers. These are sent
+to the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and
+translates unresolved logical plans into Spark's logical plan operators.
+This is similar to parsing a SQL query, where attributes and relations are
+parsed and an initial parse plan is built. From there, the standard Spark
+execution process kicks in, ensuring that Spark Connect leverages all of
+Spark's optimizations and enhancements. Results are streamed back to the
+client via gRPC as Apache Arrow-encoded row batches.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-communication.png" title="Spark Connect communication" alt="Spark Connect communication" />
+</p>
+
+# Operational benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several multi-tenant
+operational issues:
+
+**Stability**: Applications that use too much memory will now only impact their
+own environment as they can run in their own processes. Users can define their
+own dependencies on the client and don't need to worry about potential conflicts
+with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded independently
+of applications, e.g. to benefit from performance improvements and security fixes.
+This means applications can be forward-compatible, as long as the server-side RPC
+definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive debugging
+during development directly from your favorite IDE. Similarly, applications can
+be monitored using the application's framework native metrics and logging libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark and Scala
+applications. We will walk through how to run an Apache Spark server with Spark
+Connect and connect to it from a client application using the Spark Connect client
+library.
+
+## Download and start Spark server with Spark Connect
+
+First, download Spark from the
+[Download Apache Spark](https://spark.apache.org/downloads.html) page. Spark Connect
+was introduced in Apache Spark version 3.4 so make sure you choose 3.4.0 or newer in
+the release drop down at the top of the page. Then choose your package type, typically
+“Pre-built for Apache Hadoop 3.3 and later”, and click the link to download.
+
+Now extract the Spark package you just downloaded on your computer, for example:
+
+{% highlight bash %}
+tar -xvf spark-3.4.0-bin-hadoop3.tgz
+{% endhighlight %}
+
+In a terminal window, go to the `spark` folder in the location where you extracted
+Spark before and run the `start-connect-server.sh` script to start Spark server with
+Spark Connect, like in this example:
+
+{% highlight bash %}
+./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0
+{% endhighlight %}
+
+Note that we include a Spark Connect package (`spark-connect_2.12:3.4.0`), when starting
+Spark server. This is required to use Spark Connect. Make sure to use the same version
+of the package as the Spark version you downloaded above. In the example here, Spark 3.4.0
+with Scala 2.12.
+
+Now Spark server is running and ready to accept Spark Connect sessions from client
+applications. In the next section we will walk through how to use Spark Connect
+when writing client applications.
+
+## Use Spark Connect in client applications
+
+When creating a Spark session, you can specify that you want to use Spark Connect
+and there are a few ways to do that as outlined below.
+
+If you do not use one of the mechanisms outlined here, your Spark session will
+work just like before, without leveraging Spark Connect, and your application code
+will run on the Spark driver node.
+
+### Set SPARK_REMOTE environment variable
+
+If you set the `SPARK_REMOTE` environment variable on the client machine where your
+Spark client application is running and create a new Spark Session as illustrated
+below, the session will be a Spark Connect session. With this approach, there is
+no code change needed to start using Spark Connect.
+
+In a terminal window, set the `SPARK_REMOTE` environment variable to point to the
+local Spark server you started on your computer above:
+
+{% highlight bash %}
+export SPARK_REMOTE="sc://localhost"
+{% endhighlight %}
+
+And start the Spark shell as usual:
+
+<div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% highlight bash %}
+./bin/pyspark
+{% endhighlight %}
+
+The PySpark shell is now connected to Spark using Spark Connect as indicated in the welcome
+message.
+</div>
+
+</div>
+
+And if you write your own program, create a Spark session as shown in this example:
+
+<div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% highlight python %}
+from pyspark.sql import SparkSession
+spark = SparkSession.builder.getOrCreate()
+{% endhighlight %}
+</div>
+
+</div>
+
+Which will create a Spark Connect session from your application by reading the
+`SPARK_REMOTE` environment variable we set above.
+
+### Specify Spark Connect when creating Spark session
+
+You can also specify that you want to use Spark Connect explicitly when you
+create a Spark session.
+
+For example, you can launch the PySpark shell with Spark Connect as
+illustrated here.
+
+<div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+To launch the PySpark shell with Spark Connect, simply include the `remote`
+parameter and specify the location of your Spark server. We are using `localhost`
+in this example to connect to the local Spark server we started above.
+
+{% highlight bash %}
+./bin/pyspark --remote "sc://localhost"
+{% endhighlight %}
+
+And you will notice that the PySpark shell welcome message tells you that
+you have connected to Spark using Spark Connect.
+
+Now you can run PySpark code in the shell to see Spark Connect in action:
+
+{% highlight python %}
+>>> columns = ["id","name"]
+>>> data = [(1,"Sarah"),(2,"Maria")]
+>>> df = spark.createDataFrame(data).toDF(*columns)
+>>> df.show()
++---+-----+
+| id| name|
++---+-----+
+|  1|Sarah|
+|  2|Maria|
++---+-----+
+
+>>>

Review Comment:
   nit but I would remove this :-). Please make a followup PR if you find some time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "allanf-db (via GitHub)" <gi...@apache.org>.
allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130508482


##########
docs/index.md:
##########
@@ -49,8 +49,19 @@ For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is required additional
 
 # Running the Examples and Shell
 
-Spark comes with several sample programs.  Scala, Java, Python and R examples are in the
-`examples/src/main` directory. To run one of the Java or Scala sample programs, use
+Spark comes with several sample programs. Python, Scala, Java and R examples are in the
+`examples/src/main` directory.
+
+To run Spark interactively in a Python interpreter, use
+`bin/pyspark`:
+
+    ./bin/pyspark --master local[2]

Review Comment:
   I tried on my machine and it fails for me as well so I removed "--master local[2]" from these examples and then they work for me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "allanf-db (via GitHub)" <gi...@apache.org>.
allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130275977


##########
docs/spark-connect-overview.md:
##########
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Spark Connect Overview - Building client-side Spark applications
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-api.png" title="Spark Connect API" alt="Spark Connect API Diagram" />
+</p>
+
+# How Spark Connect Works
+
+The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark's DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved logical query plans which are encoded using protocol buffers. These are sent to the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and translates unresolved logical plans into Spark's logical plan operators. This is similar to parsing a SQL query, where attributes and relations are parsed and an initial parse plan is built. From there, the standard Spark execution process kicks in, ensuring that Spark Connect leverages all of Spark's optimizations and enhancements. Results are streamed back to the client via gRPC as Apache Arrow-encoded row batches.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-communication.png" title="Spark Connect communication" alt="Spark Connect communication" />
+</p>
+
+# Operational Benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several operational issues:
+
+**Stability**: Applications that use too much memory will now only impact their own environment as they can run in their own processes. Users can define their own dependencies on the client and don't need to worry about potential conflicts with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded independently of applications, e.g. to benefit from performance improvements and security fixes. This means applications can be forward-compatible, as long as the server-side RPC definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive debugging during development directly from your favorite IDE. Similarly, applications can be monitored using the application's framework native metrics and logging libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark applications. When creating a Spark session, you can specify that you want to use Spark Connect and there are a few ways to do that as outlined below.
+
+If you do not use one of the mechanisms outlined below, your Spark session will work just like before, without leveraging Spark Connect, and your application code will run on the Spark driver node.
+
+## Set SPARK_REMOTE environment variable
+
+If you set the SPARK_REMOTE environment variable on the client machine where your Spark client application is running and create a new Spark Session as illustrated below, the session will be a Spark Connect session. With this approach, there is no code change needed to start using Spark Connect.
+
+Set SPARK_REMOTE environment variable:
+
+{% highlight bash %}
+    export SPARK_REMOTE="sc://localhost/"
+{% endhighlight %}

Review Comment:
   Good point, fixing this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "allanf-db (via GitHub)" <gi...@apache.org>.
allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130361509


##########
docs/index.md:
##########
@@ -49,8 +49,19 @@ For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is required additional
 
 # Running the Examples and Shell
 
-Spark comes with several sample programs.  Scala, Java, Python and R examples are in the
-`examples/src/main` directory. To run one of the Java or Scala sample programs, use
+Spark comes with several sample programs. Python, Scala, Java and R examples are in the
+`examples/src/main` directory.
+
+To run Spark interactively in a Python interpreter, use
+`bin/pyspark`:
+
+    ./bin/pyspark --master local[2]

Review Comment:
   Yeah, I did not update the existing code samples on this page (yet)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1131838714


##########
docs/spark-connect-overview.md:
##########
@@ -0,0 +1,244 @@
+---
+layout: global
+title: Spark Connect Overview
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+**Building client-side Spark applications**
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server
+architecture that allows remote connectivity to Spark clusters using the
+DataFrame API and unresolved logical plans as the protocol. The separation
+between client and server allows Spark and its open ecosystem to be
+leveraged from everywhere. It can be embedded in modern data applications,
+in IDEs, Notebooks and programming languages.
+
+To get started, see [Quickstart: Spark Connect](api/python/getting_started/quickstart_connect.html).
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-api.png" title="Spark Connect API" alt="Spark Connect API Diagram" />
+</p>
+
+# How Spark Connect works
+
+The Spark Connect client library is designed to simplify Spark application
+development. It is a thin API that can be embedded everywhere: in application
+servers, IDEs, notebooks, and programming languages. The Spark Connect API
+builds on Spark's DataFrame API using unresolved logical plans as a
+language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved
+logical query plans which are encoded using protocol buffers. These are sent
+to the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and
+translates unresolved logical plans into Spark's logical plan operators.
+This is similar to parsing a SQL query, where attributes and relations are
+parsed and an initial parse plan is built. From there, the standard Spark
+execution process kicks in, ensuring that Spark Connect leverages all of
+Spark's optimizations and enhancements. Results are streamed back to the
+client via gRPC as Apache Arrow-encoded row batches.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-communication.png" title="Spark Connect communication" alt="Spark Connect communication" />
+</p>
+
+# Operational benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several operational issues:
+
+**Stability**: Applications that use too much memory will now only impact their
+own environment as they can run in their own processes. Users can define their
+own dependencies on the client and don't need to worry about potential conflicts
+with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded independently
+of applications, e.g. to benefit from performance improvements and security fixes.
+This means applications can be forward-compatible, as long as the server-side RPC
+definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive debugging
+during development directly from your favorite IDE. Similarly, applications can
+be monitored using the application's framework native metrics and logging libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark and Scala
+applications. we will walk through how to run an Apache Spark server with Spark
+Connect and connect to it from a client application using the Spark Connect client
+library.
+
+## Download and start Spark server with Spark Connect
+
+First, download Spark from the
+[Download Apache Spark](https://spark.apache.org/downloads.html) page. Spark Connect
+was introduced in Apache Spark version 3.4 so make sure you choose 3.4.0 or newer in
+the release drop down at the top of the page. Then choose your package type, typically
+“Pre-built for Apache Hadoop 3.3 and later”, and click the link to download.
+
+Now extract the Spark package you just downloaded on your computer, for example:
+
+{% highlight bash %}
+tar -xvf spark-3.4.0-bin-hadoop3.tgz
+{% endhighlight %}
+
+In a terminal window, now go to the `spark` folder in the location where you extracted
+Spark before and run the `start-connect-server.sh` script to start Spark server with
+Spark Connect, like in this example:
+
+{% highlight bash %}
+./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0
+{% endhighlight %}
+
+Note that we include a Spark Connect package (`spark-connect_2.12:3.4.0`), when starting
+Spark server. This is required to use Spark Connect. Make sure to use the same version
+of the package as the Spark version you downloaded above. In the example here, Spark 3.4.0
+with Scala 2.12.
+
+Now Spark server is running and ready to accept Spark Connect sessions from client
+applications. In the next section we will walk through how to use Spark Connect
+when writing client applications.
+
+## Use Spark Connect in client applications
+
+When creating a Spark session, you can specify that you want to use Spark Connect
+and there are a few ways to do that as outlined below.
+
+If you do not use one of the mechanisms outlined here, your Spark session will
+work just like before, without leveraging Spark Connect, and your application code
+will run on the Spark driver node.
+
+### Set SPARK_REMOTE environment variable
+
+If you set the `SPARK_REMOTE` environment variable on the client machine where your
+Spark client application is running and create a new Spark Session as illustrated
+below, the session will be a Spark Connect session. With this approach, there is
+no code change needed to start using Spark Connect.
+
+Set the `SPARK_REMOTE` environment variable to point to the Spark server we started
+above:
+
+{% highlight bash %}
+export SPARK_REMOTE="sc://localhost/"

Review Comment:
   nit but I would remove `/` in the end at `sc://localhost/"` -> `sc://localhost"`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1129486098


##########
docs/spark-connect-overview.md:
##########
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Spark Connect Overview - Building client-side Spark applications
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-api.png" title="Spark Connect API" alt="Spark Connect API Diagram" />
+</p>
+
+# How Spark Connect Works
+
+The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark's DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved logical query plans which are encoded using protocol buffers. These are sent to the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and translates unresolved logical plans into Spark's logical plan operators. This is similar to parsing a SQL query, where attributes and relations are parsed and an initial parse plan is built. From there, the standard Spark execution process kicks in, ensuring that Spark Connect leverages all of Spark's optimizations and enhancements. Results are streamed back to the client via gRPC as Apache Arrow-encoded row batches.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-communication.png" title="Spark Connect communication" alt="Spark Connect communication" />
+</p>
+
+# Operational Benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several operational issues:
+
+**Stability**: Applications that use too much memory will now only impact their own environment as they can run in their own processes. Users can define their own dependencies on the client and don't need to worry about potential conflicts with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded independently of applications, e.g. to benefit from performance improvements and security fixes. This means applications can be forward-compatible, as long as the server-side RPC definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive debugging during development directly from your favorite IDE. Similarly, applications can be monitored using the application's framework native metrics and logging libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark applications. When creating a Spark session, you can specify that you want to use Spark Connect and there are a few ways to do that as outlined below.
+
+If you do not use one of the mechanisms outlined below, your Spark session will work just like before, without leveraging Spark Connect, and your application code will run on the Spark driver node.
+
+## Set SPARK_REMOTE environment variable
+
+If you set the SPARK_REMOTE environment variable on the client machine where your Spark client application is running and create a new Spark Session as illustrated below, the session will be a Spark Connect session. With this approach, there is no code change needed to start using Spark Connect.
+
+Set SPARK_REMOTE environment variable:
+
+{% highlight bash %}
+    export SPARK_REMOTE="sc://localhost/"
+{% endhighlight %}
+
+Start the PySpark CLI, for example:
+
+{% highlight bash %}
+    ./bin/pyspark
+{% endhighlight %}
+
+And notice that the PySpark CLI is now connected to Spark using Spark Connect as indicated in the welcome message: “Client connected to the Spark Connect server at...”.
+
+And if you write your own Python program, create a Spark Session as shown in this example:
+
+{% highlight python %}
+    from pyspark.sql import SparkSession
+    spark = SparkSession.builder.getOrCreate()
+{% endhighlight %}
+
+Which will create a Spark Connect session by reading the SPARK_REMOTE environment variable we set above.
+
+## Specify Spark Connect when creating Spark session
+
+You can also specify that you want to use Spark Connect when you create a Spark session explicitly.
+
+For example, when launching the PySpark CLI, simply include the remote parameter as illustrated here:
+
+{% highlight bash %}
+    ./bin/pyspark --remote "sc://localhost"
+{% endhighlight %}
+
+And again you will notice that the PySpark welcome message tells you that you are connected to Spark using Spark Connect.
+
+Or, in your code, include the remote function with a reference to your Spark server when you create a Spark session, as in this example:
+
+{% highlight python %}
+    from pyspark.sql import SparkSession
+    spark = SparkSession.builder.remote("sc://localhost/").getOrCreate()
+{% endhighlight %}
+
+# Client application authentication
+
+While Spark Connect does not have built-in authentication, it is designed to work seamlessly with your existing authentication infrastructure. Its gRPC HTTP/2 interface allows for the use of authenticating proxies, which makes it possible to secure Spark Connect without having to implement authentication logic in Spark directly.
+
+# What is included in Spark 3.4
+
+In Spark 3.4, Spark Connect provides DataFrame API coverage for PySpark and 
+Spark Connect clients for other languages are planned for the future.

Review Comment:
   I believe it is important to inform users in advance that while Spark Connect supports many of the key features of PySpark, there are also some features that are not supported.
   
   For example, we can write something like this:
   
   In Spark 3.4, Spark Connect supports most of the key APIs of PySpark, including DataFrame, Functions, and Column. However, some APIs such as SparkContext and RDD are not supported yet. The support for specific PySpark APIs in Spark Connect is indicated in all PySpark API references with phrases like “Support Spark Connect”. Therefore, it is recommended to check in advance whether the APIs you are using are supported by Spark Connect before migrating existing code. Support for Spark Connect clients in other languages is planned for the future.
   
   the below is including the link as well just for your convenience (when you built the docs, the link would work):
   
   ```md
   In Spark 3.4, Spark Connect supports most of the key APIs of PySpark, including [DataFrame](api/python/reference/pyspark.sql/dataframe.html), [Functions](api/python/reference/pyspark.sql/functions.html), and [Column](api/python/reference/pyspark.sql/column.html). However, some APIs such as [SparkContext](api/python/reference/api/pyspark.SparkContext.html#pyspark.SparkContext) and [RDD](api/python/reference/api/pyspark.RDD.html#pyspark.RDD) are not supported yet. The support for specific PySpark APIs in Spark Connect is indicated in all PySpark API references with phrases like "Support Spark Connect". Therefore, it is recommended to check in advance whether the APIs you are using are supported by Spark Connect before migrating existing code. Support for Spark Connect clients in other languages is planned for the future.
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1129421316


##########
docs/index.md:
##########
@@ -86,6 +88,15 @@ Example applications are also provided in R. For example,
 
     ./bin/spark-submit examples/src/main/r/dataframe.R
 
+## Running Spark Client Applications Anywhere with Spark Connect

Review Comment:
   nit: maybe new line after the title?



##########
docs/index.md:
##########
@@ -49,8 +49,19 @@ For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is required additional
 
 # Running the Examples and Shell
 
-Spark comes with several sample programs.  Scala, Java, Python and R examples are in the
-`examples/src/main` directory. To run one of the Java or Scala sample programs, use
+Spark comes with several sample programs. Python, Scala, Java and R examples are in the
+`examples/src/main` directory.
+
+To run Spark interactively in a Python interpreter, use
+`bin/pyspark`:
+
+    ./bin/pyspark --master local[2]

Review Comment:
   Seems like it's not working in my local workspace:
   ```shell
   haejoon.lee spark % ./bin/pyspark --master local[2]
   zsh: no matches found: local[2]
   ```
   Maybe do we need some more context for this example or do we just say `./bin/pyspark` ?
   
   I recognize that it's not added from this PR, though.



##########
docs/spark-connect-overview.md:
##########
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Spark Connect Overview - Building client-side Spark applications
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-api.png" title="Spark Connect API" alt="Spark Connect API Diagram" />
+</p>
+
+# How Spark Connect Works
+
+The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark's DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved logical query plans which are encoded using protocol buffers. These are sent to the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and translates unresolved logical plans into Spark's logical plan operators. This is similar to parsing a SQL query, where attributes and relations are parsed and an initial parse plan is built. From there, the standard Spark execution process kicks in, ensuring that Spark Connect leverages all of Spark's optimizations and enhancements. Results are streamed back to the client via gRPC as Apache Arrow-encoded row batches.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-communication.png" title="Spark Connect communication" alt="Spark Connect communication" />
+</p>
+
+# Operational Benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several operational issues:
+
+**Stability**: Applications that use too much memory will now only impact their own environment as they can run in their own processes. Users can define their own dependencies on the client and don't need to worry about potential conflicts with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded independently of applications, e.g. to benefit from performance improvements and security fixes. This means applications can be forward-compatible, as long as the server-side RPC definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive debugging during development directly from your favorite IDE. Similarly, applications can be monitored using the application's framework native metrics and logging libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark applications. When creating a Spark session, you can specify that you want to use Spark Connect and there are a few ways to do that as outlined below.
+
+If you do not use one of the mechanisms outlined below, your Spark session will work just like before, without leveraging Spark Connect, and your application code will run on the Spark driver node.
+
+## Set SPARK_REMOTE environment variable
+
+If you set the SPARK_REMOTE environment variable on the client machine where your Spark client application is running and create a new Spark Session as illustrated below, the session will be a Spark Connect session. With this approach, there is no code change needed to start using Spark Connect.
+
+Set SPARK_REMOTE environment variable:
+
+{% highlight bash %}
+    export SPARK_REMOTE="sc://localhost/"
+{% endhighlight %}

Review Comment:
   Not a very big deal, but maybe can we remove the leading space? It displays a bit awkward in the document as below:
   
   <img width="401" alt="Screen Shot 2023-03-08 at 10 29 40 PM" src="https://user-images.githubusercontent.com/44108233/223725668-ce14824d-9420-4182-b34a-13c02d8c9da6.png">
   
   Actually, more important is that it may not run properly if we simply copy and paste it.
   
   The following examples as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #40324: [SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #40324:
URL: https://github.com/apache/spark/pull/40324#issuecomment-1467450414

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #40324: [SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #40324: [SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation
URL: https://github.com/apache/spark/pull/40324


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "allanf-db (via GitHub)" <gi...@apache.org>.
allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130361771


##########
docs/spark-connect-overview.md:
##########
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Spark Connect Overview - Building client-side Spark applications
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-api.png" title="Spark Connect API" alt="Spark Connect API Diagram" />
+</p>
+
+# How Spark Connect Works
+
+The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark's DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved logical query plans which are encoded using protocol buffers. These are sent to the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and translates unresolved logical plans into Spark's logical plan operators. This is similar to parsing a SQL query, where attributes and relations are parsed and an initial parse plan is built. From there, the standard Spark execution process kicks in, ensuring that Spark Connect leverages all of Spark's optimizations and enhancements. Results are streamed back to the client via gRPC as Apache Arrow-encoded row batches.
+
+<p style="text-align: center;">
+  <img src="img/spark-connect-communication.png" title="Spark Connect communication" alt="Spark Connect communication" />
+</p>
+
+# Operational Benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several operational issues:
+
+**Stability**: Applications that use too much memory will now only impact their own environment as they can run in their own processes. Users can define their own dependencies on the client and don't need to worry about potential conflicts with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded independently of applications, e.g. to benefit from performance improvements and security fixes. This means applications can be forward-compatible, as long as the server-side RPC definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive debugging during development directly from your favorite IDE. Similarly, applications can be monitored using the application's framework native metrics and logging libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark applications. When creating a Spark session, you can specify that you want to use Spark Connect and there are a few ways to do that as outlined below.
+
+If you do not use one of the mechanisms outlined below, your Spark session will work just like before, without leveraging Spark Connect, and your application code will run on the Spark driver node.
+
+## Set SPARK_REMOTE environment variable
+
+If you set the SPARK_REMOTE environment variable on the client machine where your Spark client application is running and create a new Spark Session as illustrated below, the session will be a Spark Connect session. With this approach, there is no code change needed to start using Spark Connect.
+
+Set SPARK_REMOTE environment variable:
+
+{% highlight bash %}
+    export SPARK_REMOTE="sc://localhost/"
+{% endhighlight %}
+
+Start the PySpark CLI, for example:
+
+{% highlight bash %}
+    ./bin/pyspark
+{% endhighlight %}
+
+And notice that the PySpark CLI is now connected to Spark using Spark Connect as indicated in the welcome message: “Client connected to the Spark Connect server at...”.
+
+And if you write your own Python program, create a Spark Session as shown in this example:
+
+{% highlight python %}
+    from pyspark.sql import SparkSession
+    spark = SparkSession.builder.getOrCreate()
+{% endhighlight %}
+
+Which will create a Spark Connect session by reading the SPARK_REMOTE environment variable we set above.
+
+## Specify Spark Connect when creating Spark session
+
+You can also specify that you want to use Spark Connect when you create a Spark session explicitly.
+
+For example, when launching the PySpark CLI, simply include the remote parameter as illustrated here:
+
+{% highlight bash %}
+    ./bin/pyspark --remote "sc://localhost"
+{% endhighlight %}
+
+And again you will notice that the PySpark welcome message tells you that you are connected to Spark using Spark Connect.
+
+Or, in your code, include the remote function with a reference to your Spark server when you create a Spark session, as in this example:
+
+{% highlight python %}
+    from pyspark.sql import SparkSession
+    spark = SparkSession.builder.remote("sc://localhost/").getOrCreate()
+{% endhighlight %}
+
+# Client application authentication
+
+While Spark Connect does not have built-in authentication, it is designed to work seamlessly with your existing authentication infrastructure. Its gRPC HTTP/2 interface allows for the use of authenticating proxies, which makes it possible to secure Spark Connect without having to implement authentication logic in Spark directly.
+
+# What is included in Spark 3.4
+
+In Spark 3.4, Spark Connect provides DataFrame API coverage for PySpark and 
+Spark Connect clients for other languages are planned for the future.

Review Comment:
   I added this content, thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130421548


##########
docs/index.md:
##########
@@ -49,8 +49,19 @@ For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is required additional
 
 # Running the Examples and Shell
 
-Spark comes with several sample programs.  Scala, Java, Python and R examples are in the
-`examples/src/main` directory. To run one of the Java or Scala sample programs, use
+Spark comes with several sample programs. Python, Scala, Java and R examples are in the
+`examples/src/main` directory.
+
+To run Spark interactively in a Python interpreter, use
+`bin/pyspark`:
+
+    ./bin/pyspark --master local[2]

Review Comment:
   Yeah, I think we can just update this to `./bin/pyspark --master local` or just simply `./bin/pyspark` while we're here :-)
   
   Both way looks fine to me if it's working properly (I checked both work fine in my local env)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "allanf-db (via GitHub)" <gi...@apache.org>.
allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130279011


##########
docs/index.md:
##########
@@ -86,6 +88,15 @@ Example applications are also provided in R. For example,
 
     ./bin/spark-submit examples/src/main/r/dataframe.R
 
+## Running Spark Client Applications Anywhere with Spark Connect

Review Comment:
   Fixing, thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "allanf-db (via GitHub)" <gi...@apache.org>.
allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130437989


##########
docs/index.md:
##########
@@ -86,6 +88,15 @@ Example applications are also provided in R. For example,
 
     ./bin/spark-submit examples/src/main/r/dataframe.R
 
+## Running Spark Client Applications Anywhere with Spark Connect

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130418343


##########
docs/index.md:
##########
@@ -86,6 +88,16 @@ Example applications are also provided in R. For example,
 
     ./bin/spark-submit examples/src/main/r/dataframe.R
 
+## Running Spark Client Applications Anywhere with Spark Connect
+
+Spark Connect is a new client-server architecture introduced in Spark 3.4 that decouples Spark
+client applications and allows remote connectivity to Spark clusters. The separation between
+client and server allows Spark and its open ecosystem to be leveraged from anywhere, embedded
+in any application. In Spark 3.4, Spark Connect provides DataFrame API coverage for PySpark.
+Spark Connect clients for other languages are planned for the future.
+
+To learn more about Spark Connect and how to use it, visit [Spark Connect Overview](spark-connect-overview.html).

Review Comment:
   qq: It seems that the only way to access the Spark Connect Overview is through the link provided at this point.
   Do you think that's enough, or maybe should we also place the link in a more prominent location to expose it more actively to the users?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org