You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by sh...@apache.org on 2018/05/18 03:09:41 UTC
[kylin] branch document updated: update spark doc

This is an automated email from the ASF dual-hosted git repository.

shaofengshi pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git


The following commit(s) were added to refs/heads/document by this push:
     new abda369  update spark doc
abda369 is described below

commit abda3691b0210b30dd2a03079669bf3be7d932e9
Author: shaofengshi <sh...@apache.org>
AuthorDate: Fri May 18 11:09:30 2018 +0800

    update spark doc
---
 website/_docs23/tutorial/spark.md | 152 ++++++--------------------------------
 1 file changed, 23 insertions(+), 129 deletions(-)

diff --git a/website/_docs23/tutorial/spark.md b/website/_docs23/tutorial/spark.md
index 6d599ca..3553cf0 100644
--- a/website/_docs23/tutorial/spark.md
+++ b/website/_docs23/tutorial/spark.md
@@ -8,184 +8,78 @@ permalink: /docs23/tutorial/spark.html
 
 ### Introduction
 
-Kylin provides JDBC driver to query the Cube data. Spark can query SQL databases using JDBC driver. With this, you can query Kylin's Cube from Spark and then do the analysis.
+Kylin provides JDBC driver to query the Cube data. Spark can query SQL databases using JDBC driver. With this, you can query Kylin's Cube from Spark and then do the analysis over a very huge data set.
 
-But, Kylin is an OLAP system, it is not a real database: Kylin only has aggregated data, no raw data. If you simply load the source table into Spark as a data frame, some operations like "count" might be wrong if you expect to count the raw data. 
+But, Kylin is an OLAP system, it is not a real database: Kylin only has aggregated data, no raw data. If you simply load the source table into Spark as a data frame, it may not work as the Cube data can be very huge, and some operations like "count" might be wrong.
 
-Besides, the Cube data can be very huge which is different with normal database. 
+This document describes how to use Kylin as a data source in Apache Spark. You need to install Kylin, build a Cube, and then put Kylin's JDBC driver onto your Spark application's classpath. 
 
-This document describes how to use Kylin as a data source in Apache Spark. You need install Kylin and build a Cube as the prerequisite. 
+### The wrong way
 
-### The wrong application
-
-The below Python application tries to load Kylin's table as a data frame, and then expect to get the total row count with "df.count()", but the result is incorrect.
+The below Python application tries to directly load Kylin's table as a data frame, and then expect to get the total row count with "df.count()", but the result is incorrect.
 
 {% highlight Groff markup %}
-#!/usr/bin/env python
-
-import os
-import sys
-import traceback
-import time
-import subprocess
-import json
-import re
-
-os.environ["SPARK_HOME"] = "/usr/local/spark/"
-sys.path.append(os.environ["SPARK_HOME"]+"/python")
-
-from pyspark import SparkConf, SparkContext
-from pyspark.sql import SQLContext
-
-from pyspark.sql.functions import *
-from pyspark.sql.types import *
-
-jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
-
-class Kap(object):
-    def __init__(self):
-        print 'initializing Spark context ...'
-        sys.stdout.flush()
-
-        conf = SparkConf() 
-        conf.setMaster('yarn')
-        conf.setAppName('kap test')
-
-        wdir = os.path.dirname(os.path.realpath(__file__))
-        jars_with_path = ','.join([wdir + '/' + x for x in jars])
-
-        conf.set("spark.jars", jars_with_path)
-        conf.set("spark.yarn.archive", "hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
-        conf.set("spark.driver.extraClassPath", jars_with_path.replace(",",":"))
-
-        self.sc = SparkContext(conf=conf)
-        self.sqlContext = SQLContext(self.sc)
-        print 'Spark context is initialized'
-
-        self.df = self.sqlContext.read.format('jdbc').options(
-            url='jdbc:kylin://sandbox:7070/default',
-            user='ADMIN', password='KYLIN',
-            dbtable='test_kylin_fact', driver='org.apache.kylin.jdbc.Driver').load()
-
-        self.df.registerTempTable("loltab")
-        print self.df.count()
-
-    def sql(self, cmd, result_tab_name='tmptable'):
-        df = self.sqlContext.sql(cmd) 
-        if df is not None:
-            df.registerTempTable(result_tab_name)
-        return df
+    conf = SparkConf() 
+    conf.setMaster('yarn')
+    conf.setAppName('Kylin jdbc example')
 
-    def stop(self):
-        self.sc.stop()
+    self.sc = SparkContext(conf=conf)
+    self.sqlContext = SQLContext(self.sc)
 
-kap = Kap()
-try:
-    df = kap.sql(r"select count(*) from loltab")
-    df.show(truncate=False)
-except:
-    pass
-finally:
-    kap.stop()
+    self.df = self.sqlContext.read.format('jdbc').options(
+        url='jdbc:kylin://sandbox:7070/default',
+        user='ADMIN', password='KYLIN',
+        dbtable='kylin_sales', driver='org.apache.kylin.jdbc.Driver').load()
 
+    print self.df.count()
 
+    
 {% endhighlight %}
 
 The output is:
 {% highlight Groff markup %}
-Spark context is initialized
 132
-+--------+
-|count(1)|
-+--------+
-|132     |
-+--------+
 
 {% endhighlight %}
 
 
-The result "132" here is not the total count of the origin table. The reason is that, Spark sends "select * from " query to Kylin, Kylin doesn't have the raw data, but will answer the query with aggregated data in the base Cuboid. The "132" is the row number of the base Cuboid, not source data. 
+The result "132" here is not the total count of the origin table. The reason is that, Spark sends "select * " or "select 1 " query to Kylin, Kylin doesn't have the raw data, but will answer the query with aggregated data in the base Cuboid. The "132" is the row number of the base Cuboid, not original data. 
 
 
-### The right code
+### The right way
 
-The right behavior is to push down the aggregation to Kylin, so that the Cube can be leveraged. Below is the correct code:
+The right behavior is to push down all possible aggregations to Kylin, so that the Cube can be leveraged, the performance would be much better than from source data. Below is the correct code:
 
 {% highlight Groff markup %}
-#!/usr/bin/env python
-
-import os
-import sys
-import json
-
-os.environ["SPARK_HOME"] = "/usr/local/spark/"
-sys.path.append(os.environ["SPARK_HOME"]+"/python")
-
-from pyspark import SparkConf, SparkContext
-from pyspark.sql import SQLContext
-
-from pyspark.sql.functions import *
-from pyspark.sql.types import *
-
-jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
-
 
-def demo():
-    # step 1: init
-    print 'initializing ...',
     conf = SparkConf() 
     conf.setMaster('yarn')
-    conf.setAppName('jdbc example')
-
-    wdir = os.path.dirname(os.path.realpath(__file__))
-    jars_with_path = ','.join([wdir + '/' + x for x in jars])
-
-    conf.set("spark.jars", jars_with_path)
-    conf.set("spark.yarn.archive", "hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
-        
-    conf.set("spark.driver.extraClassPath", jars_with_path.replace(",",":"))
+    conf.setAppName('Kylin jdbc example')
 
     sc = SparkContext(conf=conf)
     sql_ctx = SQLContext(sc)
-    print 'done'
-
+  
     url='jdbc:kylin://sandbox:7070/default'
-    tab_name = '(select count(*) as total from test_kylin_fact) the_alias'
+    tab_name = '(select count(*) as total from kylin_sales) the_alias'
 
-    # step 2: initiate the sql
     df = sql_ctx.read.format('jdbc').options(
             url=url, user='ADMIN', password='KYLIN',
             driver='org.apache.kylin.jdbc.Driver',
             dbtable=tab_name).load()
 
-    # many ways to obtain the results
     df.show()
 
-    print "df.count()", df.count()  # must be 1, as there is only one row
-
-    for record in df.toJSON().collect():
-        # this loop has only one iteration
-        # reach record is a string; need to be decoded to JSON
-        print 'the total column: ', json.loads(record)['TOTAL']
-
-    sc.stop()
-
-demo()
-
 {% endhighlight %}
 
-Here is the output, which is expected:
+Here is the output, the result is correct as Spark push down the aggregation to Kylin:
 
 {% highlight Groff markup %}
-initializing ... done
 +-----+
 |TOTAL|
 +-----+
 | 2000|
 +-----+
 
-df.count() 1
-the total column:  2000
 {% endhighlight %}
 
 Thanks for the input and sample code from Shuxin Yang (shuxinyang.oss@gmail.com).

-- 
To stop receiving notification emails like this one, please contact
shaofengshi@apache.org.