You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by li...@apache.org on 2018/05/18 03:17:10 UTC

svn commit: r1831821 - in /kylin/site: docs23/tutorial/spark.html feed.xml

Author: lidong
Date: Fri May 18 03:17:10 2018
New Revision: 1831821

URL: http://svn.apache.org/viewvc?rev=1831821&view=rev
Log:
update spark doc

Modified:
    kylin/site/docs23/tutorial/spark.html
    kylin/site/feed.xml

Modified: kylin/site/docs23/tutorial/spark.html
URL: http://svn.apache.org/viewvc/kylin/site/docs23/tutorial/spark.html?rev=1831821&r1=1831820&r2=1831821&view=diff
==============================================================================
--- kylin/site/docs23/tutorial/spark.html (original)
+++ kylin/site/docs23/tutorial/spark.html Fri May 18 03:17:10 2018
@@ -4372,172 +4372,64 @@
 							<article class="post-content" >	
 							<h3 id="introduction">Introduction</h3>
 
-<p>Kylin provides JDBC driver to query the Cube data. Spark can query SQL databases using JDBC driver. With this, you can query Kylin’s Cube from Spark and then do the analysis.</p>
+<p>Kylin provides JDBC driver to query the Cube data. Spark can query SQL databases using JDBC driver. With this, you can query Kylin’s Cube from Spark and then do the analysis over a very huge data set.</p>
 
-<p>But, Kylin is an OLAP system, it is not a real database: Kylin only has aggregated data, no raw data. If you simply load the source table into Spark as a data frame, some operations like “count” might be wrong if you expect to count the raw data.</p>
+<p>But, Kylin is an OLAP system, it is not a real database: Kylin only has aggregated data, no raw data. If you simply load the source table into Spark as a data frame, it may not work as the Cube data can be very huge, and some operations like “count” might be wrong.</p>
 
-<p>Besides, the Cube data can be very huge which is different with normal database.</p>
+<p>This document describes how to use Kylin as a data source in Apache Spark. You need to install Kylin, build a Cube, and then put Kylin’s JDBC driver onto your Spark application’s classpath.</p>
 
-<p>This document describes how to use Kylin as a data source in Apache Spark. You need install Kylin and build a Cube as the prerequisite.</p>
+<h3 id="the-wrong-way">The wrong way</h3>
 
-<h3 id="the-wrong-application">The wrong application</h3>
+<p>The below Python application tries to directly load Kylin’s table as a data frame, and then expect to get the total row count with “df.count()”, but the result is incorrect.</p>
 
-<p>The below Python application tries to load Kylin’s table as a data frame, and then expect to get the total row count with “df.count()”, but the result is incorrect.</p>
-
-<div class="highlight"><pre><code class="language-groff" data-lang="groff">#!/usr/bin/env python
-
-import os
-import sys
-import traceback
-import time
-import subprocess
-import json
-import re
-
-os.environ["SPARK_HOME"] = "/usr/local/spark/"
-sys.path.append(os.environ["SPARK_HOME"]+"/python")
-
-from pyspark import SparkConf, SparkContext
-from pyspark.sql import SQLContext
-
-from pyspark.sql.functions import *
-from pyspark.sql.types import *
-
-jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
-
-class Kap(object):
-    def __init__(self):
-        print 'initializing Spark context ...'
-        sys.stdout.flush()
-
-        conf = SparkConf() 
-        conf.setMaster('yarn')
-        conf.setAppName('kap test')
-
-        wdir = os.path.dirname(os.path.realpath(__file__))
-        jars_with_path = ','.join([wdir + '/' + x for x in jars])
-
-        conf.set("spark.jars", jars_with_path)
-        conf.set("spark.yarn.archive", "hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
-        conf.set("spark.driver.extraClassPath", jars_with_path.replace(",",":"))
-
-        self.sc = SparkContext(conf=conf)
-        self.sqlContext = SQLContext(self.sc)
-        print 'Spark context is initialized'
-
-        self.df = self.sqlContext.read.format('jdbc').options(
-            url='jdbc:kylin://sandbox:7070/default',
-            user='ADMIN', password='KYLIN',
-            dbtable='test_kylin_fact', driver='org.apache.kylin.jdbc.Driver').load()
-
-        self.df.registerTempTable("loltab")
-        print self.df.count()
+<div class="highlight"><pre><code class="language-groff" data-lang="groff">conf = SparkConf() 
+    conf.setMaster('yarn')
+    conf.setAppName('Kylin jdbc example')
 
-    def sql(self, cmd, result_tab_name='tmptable'):
-        df = self.sqlContext.sql(cmd) 
-        if df is not None:
-            df.registerTempTable(result_tab_name)
-        return df
+    self.sc = SparkContext(conf=conf)
+    self.sqlContext = SQLContext(self.sc)
 
-    def stop(self):
-        self.sc.stop()
+    self.df = self.sqlContext.read.format('jdbc').options(
+        url='jdbc:kylin://sandbox:7070/default',
+        user='ADMIN', password='KYLIN',
+        dbtable='kylin_sales', driver='org.apache.kylin.jdbc.Driver').load()
 
-kap = Kap()
-try:
-    df = kap.sql(r"select count(*) from loltab")
-    df.show(truncate=False)
-except:
-    pass
-finally:
-    kap.stop()</code></pre></div>
+    print self.df.count()</code></pre></div>
 
 <p>The output is:</p>
 
-<div class="highlight"><pre><code class="language-groff" data-lang="groff">Spark context is initialized
-132
-+--------+
-|count(1)|
-+--------+
-|132     |
-+--------+</code></pre></div>
-
-<p>The result “132” here is not the total count of the origin table. The reason is that, Spark sends “select * from “ query to Kylin, Kylin doesn’t have the raw data, but will answer the query with aggregated data in the base Cuboid. The “132” is the row number of the base Cuboid, not source data.</p>
-
-<h3 id="the-right-code">The right code</h3>
+<div class="highlight"><pre><code class="language-groff" data-lang="groff">132</code></pre></div>
 
-<p>The right behavior is to push down the aggregation to Kylin, so that the Cube can be leveraged. Below is the correct code:</p>
+<p>The result “132” here is not the total count of the origin table. The reason is that, Spark sends “select * “ or “select 1 “ query to Kylin, Kylin doesn’t have the raw data, but will answer the query with aggregated data in the base Cuboid. The “132” is the row number of the base Cuboid, not original data.</p>
 
-<div class="highlight"><pre><code class="language-groff" data-lang="groff">#!/usr/bin/env python
+<h3 id="the-right-way">The right way</h3>
 
-import os
-import sys
-import json
+<p>The right behavior is to push down all possible aggregations to Kylin, so that the Cube can be leveraged, the performance would be much better than from source data. Below is the correct code:</p>
 
-os.environ["SPARK_HOME"] = "/usr/local/spark/"
-sys.path.append(os.environ["SPARK_HOME"]+"/python")
-
-from pyspark import SparkConf, SparkContext
-from pyspark.sql import SQLContext
-
-from pyspark.sql.functions import *
-from pyspark.sql.types import *
-
-jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
-
-
-def demo():
-    # step 1: init
-    print 'initializing ...',
-    conf = SparkConf() 
+<div class="highlight"><pre><code class="language-groff" data-lang="groff">conf = SparkConf() 
     conf.setMaster('yarn')
-    conf.setAppName('jdbc example')
-
-    wdir = os.path.dirname(os.path.realpath(__file__))
-    jars_with_path = ','.join([wdir + '/' + x for x in jars])
-
-    conf.set("spark.jars", jars_with_path)
-    conf.set("spark.yarn.archive", "hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
-        
-    conf.set("spark.driver.extraClassPath", jars_with_path.replace(",",":"))
+    conf.setAppName('Kylin jdbc example')
 
     sc = SparkContext(conf=conf)
     sql_ctx = SQLContext(sc)
-    print 'done'
-
+  
     url='jdbc:kylin://sandbox:7070/default'
-    tab_name = '(select count(*) as total from test_kylin_fact) the_alias'
+    tab_name = '(select count(*) as total from kylin_sales) the_alias'
 
-    # step 2: initiate the sql
     df = sql_ctx.read.format('jdbc').options(
             url=url, user='ADMIN', password='KYLIN',
             driver='org.apache.kylin.jdbc.Driver',
             dbtable=tab_name).load()
 
-    # many ways to obtain the results
-    df.show()
+    df.show()</code></pre></div>
 
-    print "df.count()", df.count()  # must be 1, as there is only one row
+<p>Here is the output, the result is correct as Spark push down the aggregation to Kylin:</p>
 
-    for record in df.toJSON().collect():
-        # this loop has only one iteration
-        # reach record is a string; need to be decoded to JSON
-        print 'the total column: ', json.loads(record)['TOTAL']
-
-    sc.stop()
-
-demo()</code></pre></div>
-
-<p>Here is the output, which is expected:</p>
-
-<div class="highlight"><pre><code class="language-groff" data-lang="groff">initializing ... done
-+-----+
+<div class="highlight"><pre><code class="language-groff" data-lang="groff">+-----+
 |TOTAL|
 +-----+
 | 2000|
-+-----+
-
-df.count() 1
-the total column:  2000</code></pre></div>
++-----+</code></pre></div>
 
 <p>Thanks for the input and sample code from Shuxin Yang (shuxinyang.oss@gmail.com).</p>
 

Modified: kylin/site/feed.xml
URL: http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1831821&r1=1831820&r2=1831821&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Fri May 18 03:17:10 2018
@@ -19,8 +19,8 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml" rel="self" type="application/rss+xml"/>
-    <pubDate>Thu, 17 May 2018 06:59:24 -0700</pubDate>
-    <lastBuildDate>Thu, 17 May 2018 06:59:24 -0700</lastBuildDate>
+    <pubDate>Thu, 17 May 2018 20:11:39 -0700</pubDate>
+    <lastBuildDate>Thu, 17 May 2018 20:11:39 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>