You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kyuubi.apache.org by ch...@apache.org on 2022/09/26 07:13:33 UTC
[incubator-kyuubi] branch master updated: [KYUUBI #3406] [FOLLOWUP] Add create datasource table DDL usage to Pyspark docs

This is an automated email from the ASF dual-hosted git repository.

chengpan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-kyuubi.git


The following commit(s) were added to refs/heads/master by this push:
     new eb04c7f2e [KYUUBI #3406] [FOLLOWUP] Add create datasource table DDL usage to Pyspark docs
eb04c7f2e is described below

commit eb04c7f2ef8e53989c18a8637798e948d14779e0
Author: Bowen Liang <li...@gf.com.cn>
AuthorDate: Mon Sep 26 15:13:22 2022 +0800

    [KYUUBI #3406] [FOLLOWUP] Add create datasource table DDL usage to Pyspark docs
    
    ### _Why are the changes needed?_
    
    Following #3406 , fixing spelling mistakes  and  adding new DDL usage for jdbc source in  PySpark client docs.
    
    ### _How was this patch tested?_
    - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
    
    - [ ] Add screenshots for manual tests if appropriate
    
    - [ ] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request
    
    Closes #3552 from bowenliang123/pyspark-docs-improve.
    
    Closes #3406
    
    eb05a302 [Bowen Liang] add docs for using as JDBC Datasource table with DDL. and minor spelling fix.
    
    Authored-by: Bowen Liang <li...@gf.com.cn>
    Signed-off-by: Cheng Pan <ch...@apache.org>
---
 docs/client/python/pyspark.md | 44 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/docs/client/python/pyspark.md b/docs/client/python/pyspark.md
index a829a08f6..01427940f 100644
--- a/docs/client/python/pyspark.md
+++ b/docs/client/python/pyspark.md
@@ -23,7 +23,7 @@
 ## Requirements
 PySpark works with Python 3.7 and above.
 
-Install PySpark with Spark SQL and optional pandas on Spark using PyPI as follows:
+Install PySpark with Spark SQL and optional pandas support on Spark using PyPI as follows:
 
 ```shell
 pip install pyspark 'pyspark[sql]' 'pyspark[pandas_on_spark]'
@@ -31,7 +31,7 @@ pip install pyspark 'pyspark[sql]' 'pyspark[pandas_on_spark]'
 
 For installation using Conda or manually downloading, please refer to [PySpark installation](https://spark.apache.org/docs/latest/api/python/getting_started/install.html).
 
-## Preperation
+## Preparation
 
 
 ### Prepare JDBC driver 
@@ -46,15 +46,15 @@ Refer to docs of the driver and prepare the JDBC driver jar file.
 
 ### Prepare JDBC Hive Dialect extension
 
-Hive Dialect support is requried by Spark for wraping SQL correctly and sending to JDBC driver. Kyuubi provides a JDBC dialect extension with auto regiested Hive Daliect support for Spark. Follow the instrunctions in [Hive Dialect Support](../../engines/spark/jdbc-dialect.html) to prepare the plugin jar file `kyuubi-extension-spark-jdbc-dialect_-*.jar`.
+Hive Dialect support is required by Spark for wrapping SQL correctly and sending it to the JDBC driver. Kyuubi provides a JDBC dialect extension with auto-registered Hive Daliect support for Spark. Follow the instructions in [Hive Dialect Support](../../engines/spark/jdbc-dialect.html) to prepare the plugin jar file `kyuubi-extension-spark-jdbc-dialect_-*.jar`.
 
-### Including jars of JDBC driver and Hive Dialect extention
+### Including jars of JDBC driver and Hive Dialect extension
 
-Choose one of following ways to include jar files to Spark.
+Choose one of the following ways to include jar files in Spark.
 
 - Put the jar file of JDBC driver and Hive Dialect to `$SPARK_HOME/jars` directory to make it visible for the classpath of PySpark. And adding `spark.sql.extensions = org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension` to `$SPARK_HOME/conf/spark_defaults.conf.`
 
-- With spark's start shell, include JDBC driver when you submit the application with `--packages`, and the Hive Dialect plugins with `--jars`
+- With spark's start shell, include the JDBC driver when submitting the application with `--packages`, and the Hive Dialect plugins with `--jars`
 
 ```
 $SPARK_HOME/bin/pyspark --py-files PY_FILES \
@@ -79,10 +79,10 @@ spark = SparkSession.builder \
 
 For further information about PySpark JDBC usage and options, please refer to Spark's [JDBC To Other Databases](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
 
-### Reading and Writing via JDBC data source
+### Using as JDBC Datasource programmingly
 
 ```python
-# Loading data from Kyuubi via HiveDriver as JDBC source
+# Loading data from Kyuubi via HiveDriver as JDBC datasource
 jdbcDF = spark.read \
   .format("jdbc") \
   .options(driver="org.apache.hive.jdbc.HiveDriver",
@@ -94,7 +94,7 @@ jdbcDF = spark.read \
   .load()
 
 
-# Saving data to Kyuubi via HiveDriver as JDBC source
+# Saving data to Kyuubi via HiveDriver as JDBC datasource
 jdbcDF.write \
     .format("jdbc") \
     .options(driver="org.apache.hive.jdbc.HiveDriver",
@@ -106,6 +106,32 @@ jdbcDF.write \
     .save()
 ```
 
+### Using as JDBC Datasource table with SQL
+
+From Spark 3.2.0, [`CREATE DATASOURCE TABLE`](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html) is supported to create jdbc source with SQL.
+
+
+```python
+# create JDBC Datasource table with DDL
+spark.sql("""CREATE TABLE kyuubi_table USING JDBC
+OPTIONS (
+    driver='org.apache.hive.jdbc.HiveDriver',
+    url='jdbc:hive2://kyuubi_server_ip:port',
+    user='user',
+    password='password',
+    dbtable='testdb.some_table'
+)""")
+
+# read data to dataframe
+jdbcDF = spark.sql("SELECT * FROM kyuubi_table")
+
+# write data from dataframe in overwrite mode
+df.writeTo("kyuubi_table").overwrite
+
+# write data from query
+spark.sql("INSERT INTO kyuubi_table SELECT * FROM some_table")
+```
+
 
 ### Use PySpark with Pandas
 From PySpark 3.2.0, PySpark supports pandas API on Spark which allows you to scale your pandas workload out.