You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kyuubi.apache.org by ch...@apache.org on 2022/09/24 15:37:56 UTC

[incubator-kyuubi] branch branch-1.6 updated: [KYUUBI #3406] [Subtask] [Doc] Add PySpark client docs

This is an automated email from the ASF dual-hosted git repository.

chengpan pushed a commit to branch branch-1.6
in repository https://gitbox.apache.org/repos/asf/incubator-kyuubi.git


The following commit(s) were added to refs/heads/branch-1.6 by this push:
     new 6d437a0c5 [KYUUBI #3406] [Subtask] [Doc] Add PySpark client docs
6d437a0c5 is described below

commit 6d437a0c5ce43b8dd88540c06413959a44b3c67a
Author: Bowen Liang <li...@gf.com.cn>
AuthorDate: Sat Sep 24 23:37:30 2022 +0800

    [KYUUBI #3406] [Subtask] [Doc] Add PySpark client docs
    
    ### _Why are the changes needed?_
    
    close #3406.
    
    Add PySpark client docs.
    
    ### _How was this patch tested?_
    - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
    
    - [ ] Add screenshots for manual tests if appropriate
    
    - [ ] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request
    
    Closes #3407 from bowenliang123/3406-pyspark-docs.
    
    Closes #3406
    
    a181a5ba [Bowen Liang] nit
    fb0cfcdf [Bowen Liang] nit
    378ca025 [Bowen Liang] nit
    70b007a8 [Bowen Liang] update docs of including jdbc jars
    170a3b4b [liangbowen] nit
    ebb56e14 [liangbowen] add pyspark link to python page
    76bef457 [liangbowen] add start shell docs for adding jars
    65d8dbf9 [liangbowen] add docs
    c55d3ae2 [Bowen Liang] update jdbc usage sample
    692197e0 [Bowen Liang] init pyspark client docs
    
    Lead-authored-by: Bowen Liang <li...@gf.com.cn>
    Co-authored-by: liangbowen <li...@gf.com.cn>
    Signed-off-by: Cheng Pan <ch...@apache.org>
    (cherry picked from commit f1c49bb75c36079efd7feb0a61f284069bbab977)
    Signed-off-by: Cheng Pan <ch...@apache.org>
---
 docs/client/python/index.rst  |   3 +-
 docs/client/python/pyspark.md | 122 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/docs/client/python/index.rst b/docs/client/python/index.rst
index 6dfbec071..70d2bc9e3 100644
--- a/docs/client/python/index.rst
+++ b/docs/client/python/index.rst
@@ -14,11 +14,12 @@
    limitations under the License.
 
 
-Python DB-APIs
+Python
 ==============
 
 .. toctree::
     :maxdepth: 2
 
     pyhive
+    pyspark
 
diff --git a/docs/client/python/pyspark.md b/docs/client/python/pyspark.md
new file mode 100644
index 000000000..a829a08f6
--- /dev/null
+++ b/docs/client/python/pyspark.md
@@ -0,0 +1,122 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+
+# PySpark
+
+[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is an interface for Apache Spark in Python. Kyuubi can be used as JDBC source in PySpark.
+
+## Requirements
+PySpark works with Python 3.7 and above.
+
+Install PySpark with Spark SQL and optional pandas on Spark using PyPI as follows:
+
+```shell
+pip install pyspark 'pyspark[sql]' 'pyspark[pandas_on_spark]'
+```
+
+For installation using Conda or manually downloading, please refer to [PySpark installation](https://spark.apache.org/docs/latest/api/python/getting_started/install.html).
+
+## Preperation
+
+
+### Prepare JDBC driver 
+Prepare JDBC driver jar file. Supported Hive compatible JDBC Driver as below:
+
+| Driver     | Driver Class Name | Remarks|
+| ---------- | ----------------- | ----- |
+| Kyuubi Hive Driver ([doc](../jdbc/kyuubi_jdbc.html))| org.apache.kyuubi.jdbc.KyuubiHiveDriver |  Compile for the driver on master branch, as [KYUUBI #3484](https://github.com/apache/incubator-kyuubi/pull/3485) required by Spark JDBC source not yet included in released version.
+| Hive Driver ([doc](../jdbc/hive_jdbc.html))| org.apache.hive.jdbc.HiveDriver | 
+
+Refer to docs of the driver and prepare the JDBC driver jar file.
+
+### Prepare JDBC Hive Dialect extension
+
+Hive Dialect support is requried by Spark for wraping SQL correctly and sending to JDBC driver. Kyuubi provides a JDBC dialect extension with auto regiested Hive Daliect support for Spark. Follow the instrunctions in [Hive Dialect Support](../../engines/spark/jdbc-dialect.html) to prepare the plugin jar file `kyuubi-extension-spark-jdbc-dialect_-*.jar`.
+
+### Including jars of JDBC driver and Hive Dialect extention
+
+Choose one of following ways to include jar files to Spark.
+
+- Put the jar file of JDBC driver and Hive Dialect to `$SPARK_HOME/jars` directory to make it visible for the classpath of PySpark. And adding `spark.sql.extensions = org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension` to `$SPARK_HOME/conf/spark_defaults.conf.`
+
+- With spark's start shell, include JDBC driver when you submit the application with `--packages`, and the Hive Dialect plugins with `--jars`
+
+```
+$SPARK_HOME/bin/pyspark --py-files PY_FILES \
+--packages org.apache.hive:hive-jdbc:x.y.z \
+--jars /path/kyuubi-extension-spark-jdbc-dialect_-*.jar 
+```
+
+- Setting jars and config with SparkSession builder
+
+```python
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder \
+        .config("spark.jars", "/path/hive-jdbc-x.y.z.jar,/path/kyuubi-extension-spark-jdbc-dialect_-*.jar") \
+        .config("spark.sql.extensions", "org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension") \
+        .getOrCreate()
+```
+
+
+
+## Usage
+
+For further information about PySpark JDBC usage and options, please refer to Spark's [JDBC To Other Databases](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
+
+### Reading and Writing via JDBC data source
+
+```python
+# Loading data from Kyuubi via HiveDriver as JDBC source
+jdbcDF = spark.read \
+  .format("jdbc") \
+  .options(driver="org.apache.hive.jdbc.HiveDriver",
+           url="jdbc:hive2://kyuubi_server_ip:port",
+           user="user",
+           password="password",
+           query="select * from testdb.src_table"
+           ) \
+  .load()
+
+
+# Saving data to Kyuubi via HiveDriver as JDBC source
+jdbcDF.write \
+    .format("jdbc") \
+    .options(driver="org.apache.hive.jdbc.HiveDriver",
+             url="jdbc:hive2://kyuubi_server_ip:port",
+           user="user",
+           password="password",
+           dbtable="testdb.tgt_table"
+           ) \
+    .save()
+```
+
+
+### Use PySpark with Pandas
+From PySpark 3.2.0, PySpark supports pandas API on Spark which allows you to scale your pandas workload out.
+
+Pandas-on-Spark DataFrame and Spark DataFrame are virtually interchangeable. More instructions in [From/to pandas and PySpark DataFrames](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/pandas_pyspark.html#pyspark).
+
+
+```python
+import pyspark.pandas as ps
+
+psdf = ps.range(10)
+sdf = psdf.to_spark().filter("id > 5")
+sdf.show()
+```