You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by gr...@apache.org on 2019/06/12 15:23:10 UTC
[kudu] 01/02: docs: Add simplest possible Spark SQL example

This is an automated email from the ASF dual-hosted git repository.

granthenke pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit d25c6f11ca8611bbe30b6951fbda7397b0a3ac88
Author: Mike Percy <mp...@apache.org>
AuthorDate: Tue Jul 31 15:53:52 2018 -0700

    docs: Add simplest possible Spark SQL example
    
    Often I look for a simple "hello world" example in the Kudu Spark docs
    and I remember that there isn't one. I've added a quick-and-dirty Spark
    SQL example.
    
    Change-Id: I2cf4c00f3a1dc92fd93458aa3c1b1d2cd4f38f78
    Reviewed-on: http://gerrit.cloudera.org:8080/11095
    Tested-by: Kudu Jenkins
    Reviewed-by: Grant Henke <gr...@apache.org>
    Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
 docs/developing.adoc | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/docs/developing.adoc b/docs/developing.adoc
index 89db3e8..210db59 100644
--- a/docs/developing.adoc
+++ b/docs/developing.adoc
@@ -109,7 +109,26 @@ on the link:http://kudu.apache.org/releases/[releases page].
 spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.10.0
 ----
 
-then import kudu-spark and create a dataframe:
+Below is a minimal Spark SQL "select" example for a Kudu table created with
+Impala in the "default" database. We first import the kudu spark package,
+then create a DataFrame, and then create a view from the DataFrame. After those
+steps, the table is accessible from Spark SQL.
+
+[source,scala]
+----
+import org.apache.kudu.spark.kudu._
+
+// Create a DataFrame that points to the Kudu table we want to query.
+val df = spark.read.options(Map("kudu.master" -> "master1.foo.com,master2.foo.com,master3.foo.com",
+                                "kudu.table" -> "default.my_table")).kudu
+// Create a view from the DataFrame to make it accessible from Spark SQL.
+df.createOrReplaceTempView("my_table")
+// Now we can run Spark SQL queries against our view of the Kudu table.
+spark.sql("select * from my_table").show()
+----
+
+Below is a more sophisticated example that includes both reads and writes:
+
 [source,scala]
 ----
 import org.apache.kudu.client._
@@ -124,14 +143,14 @@ val df = spark.read
 df.select("id").filter("id >= 5").show()
 
 // ...or register a temporary table and use SQL
-df.registerTempTable("kudu_table")
+df.createOrReplaceTempView("kudu_table")
 val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()
 
 // Use KuduContext to create, delete, or write to Kudu tables
 val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)
 
-// Create a new Kudu table from a dataframe schema
-// NB: No rows from the dataframe are inserted into the table
+// Create a new Kudu table from a DataFrame schema
+// NB: No rows from the DataFrame are inserted into the table
 kuduContext.createTable(
     "test_table", df.schema, Seq("key"),
     new CreateTableOptions()
@@ -170,15 +189,15 @@ kuduContext.deleteTable("unwanted_table")
 
 === Upsert option in Kudu Spark
 The upsert operation in kudu-spark supports an extra write option of `ignoreNull`. If set to true,
-it will avoid setting existing column values in Kudu table to Null if the corresponding dataframe
+it will avoid setting existing column values in Kudu table to Null if the corresponding DataFrame
 column values are Null. If unspecified, `ignoreNull` is false by default.
 [source,scala]
 ----
-val dataDF = spark.read
+val dataFrame = spark.read
   .options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> simpleTableName))
   .format("kudu").load
-dataDF.registerTempTable(simpleTableName)
-dataDF.show()
+dataFrame.createOrReplaceTempView(simpleTableName)
+dataFrame.show()
 // Below is the original data in the table 'simpleTableName'
 +---+---+
 |key|val|
@@ -191,7 +210,7 @@ val nullDF = spark.createDataFrame(Seq((0, null.asInstanceOf[String]))).toDF("ke
 val wo = new KuduWriteOptions
 wo.ignoreNull = true
 kuduContext.upsertRows(nullDF, simpleTableName, wo)
-dataDF.show()
+dataFrame.show()
 // The val field stays unchanged
 +---+---+
 |key|val|