You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by li...@apache.org on 2018/05/18 03:45:25 UTC

svn commit: r1831822 - in /kylin/site: docs23/tutorial/spark.html feed.xml

Author: lidong
Date: Fri May 18 03:45:24 2018
New Revision: 1831822

URL: http://svn.apache.org/viewvc?rev=1831822&view=rev
Log:
KYLIN-3383 add document for Spark JDBC

Modified:
    kylin/site/docs23/tutorial/spark.html
    kylin/site/feed.xml

Modified: kylin/site/docs23/tutorial/spark.html
URL: http://svn.apache.org/viewvc/kylin/site/docs23/tutorial/spark.html?rev=1831822&r1=1831821&r2=1831822&view=diff
==============================================================================
--- kylin/site/docs23/tutorial/spark.html (original)
+++ kylin/site/docs23/tutorial/spark.html Fri May 18 03:45:24 2018
@@ -4372,56 +4372,59 @@
 							<article class="post-content" >	
 							<h3 id="introduction">Introduction</h3>
 
-<p>Kylin provides JDBC driver to query the Cube data. Spark can query SQL databases using JDBC driver. With this, you can query Kylin’s Cube from Spark and then do the analysis over a very huge data set.</p>
+<p>Apache Kylin provides JDBC driver to query the Cube data, and Apache Spark supports JDBC data source. With it, you can connect with Kylin from your Spark application and then do the analysis over a very huge data set in an interactive way.</p>
 
-<p>But, Kylin is an OLAP system, it is not a real database: Kylin only has aggregated data, no raw data. If you simply load the source table into Spark as a data frame, it may not work as the Cube data can be very huge, and some operations like “count” might be wrong.</p>
+<p>Please keep in mind, Kylin is an OLAP system, which already aggregated the raw data by the given dimensions. If you simply load the source table like a normal database, you may not gain the benefit of Cubes, and it may crash your application.</p>
 
-<p>This document describes how to use Kylin as a data source in Apache Spark. You need to install Kylin, build a Cube, and then put Kylin’s JDBC driver onto your Spark application’s classpath.</p>
+<p>The right way is to start from a summarized view (e.g., a query with “group by”), loading it as a data frame, and then do the transformation and other actions.</p>
+
+<p>This document describes how to use Kylin as a data source in Apache Spark. You need to install Kylin, build a Cube before run it. And remember to put Kylin’s JDBC driver (in the ‘lib’ folder of Kylin binary package) onto Spark’s class path.</p>
 
 <h3 id="the-wrong-way">The wrong way</h3>
 
-<p>The below Python application tries to directly load Kylin’s table as a data frame, and then expect to get the total row count with “df.count()”, but the result is incorrect.</p>
+<p>The below Python application tries to directly load Kylin’s table as a data frame, and then to get the total row count with “df.count()”, but the result is incorrect.</p>
 
 <div class="highlight"><pre><code class="language-groff" data-lang="groff">conf = SparkConf() 
-    conf.setMaster('yarn')
-    conf.setAppName('Kylin jdbc example')
+conf.setMaster('yarn')
+conf.setAppName('Kylin jdbc example')
 
-    self.sc = SparkContext(conf=conf)
-    self.sqlContext = SQLContext(self.sc)
+sc = SparkContext(conf=conf)
+sqlContext = SQLContext(self.sc)
 
-    self.df = self.sqlContext.read.format('jdbc').options(
-        url='jdbc:kylin://sandbox:7070/default',
-        user='ADMIN', password='KYLIN',
-        dbtable='kylin_sales', driver='org.apache.kylin.jdbc.Driver').load()
+url='jdbc:kylin://sandbox:7070/default'
+df = self.sqlContext.read.format('jdbc').options(
+    url=url, user='ADMIN', password='KYLIN',
+    driver='org.apache.kylin.jdbc.Driver',
+    dbtable='kylin_sales').load()
 
-    print self.df.count()</code></pre></div>
+print df.count()</code></pre></div>
 
 <p>The output is:</p>
 
 <div class="highlight"><pre><code class="language-groff" data-lang="groff">132</code></pre></div>
 
-<p>The result “132” here is not the total count of the origin table. The reason is that, Spark sends “select * “ or “select 1 “ query to Kylin, Kylin doesn’t have the raw data, but will answer the query with aggregated data in the base Cuboid. The “132” is the row number of the base Cuboid, not original data.</p>
+<p>The result “132” is not the total count of the origin table. Because Spark didn’t send a “select count(*)” query to Kylin as you thought, but send a “select * “ and then try to count within Spark; This would be inefficient and, as Kylin doesn’t have the raw data, the “select * “ query will be answered with the base Cuboid (summarized by all dimensions). The “132” is the row number of the base Cuboid, not original data.</p>
 
 <h3 id="the-right-way">The right way</h3>
 
-<p>The right behavior is to push down all possible aggregations to Kylin, so that the Cube can be leveraged, the performance would be much better than from source data. Below is the correct code:</p>
+<p>The right behavior is to push down possible aggregations to Kylin, so that the Cube can be leveraged and the performance would be much better. Below is the correct code:</p>
 
 <div class="highlight"><pre><code class="language-groff" data-lang="groff">conf = SparkConf() 
-    conf.setMaster('yarn')
-    conf.setAppName('Kylin jdbc example')
+conf.setMaster('yarn')
+conf.setAppName('Kylin jdbc example')
 
-    sc = SparkContext(conf=conf)
-    sql_ctx = SQLContext(sc)
+sc = SparkContext(conf=conf)
+sqlContext = SQLContext(sc)
   
-    url='jdbc:kylin://sandbox:7070/default'
-    tab_name = '(select count(*) as total from kylin_sales) the_alias'
+url='jdbc:kylin://sandbox:7070/default'
+tab_name = '(select count(*) as total from kylin_sales) the_alias'
 
-    df = sql_ctx.read.format('jdbc').options(
-            url=url, user='ADMIN', password='KYLIN',
-            driver='org.apache.kylin.jdbc.Driver',
-            dbtable=tab_name).load()
+df = sqlContext.read.format('jdbc').options(
+        url=url, user='ADMIN', password='KYLIN',
+        driver='org.apache.kylin.jdbc.Driver',
+        dbtable=tab_name).load()
 
-    df.show()</code></pre></div>
+df.show()</code></pre></div>
 
 <p>Here is the output, the result is correct as Spark push down the aggregation to Kylin:</p>
 
@@ -4433,6 +4436,7 @@
 
 <p>Thanks for the input and sample code from Shuxin Yang (shuxinyang.oss@gmail.com).</p>
 
+
 							</article>
 						</div>
 					</div>

Modified: kylin/site/feed.xml
URL: http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1831822&r1=1831821&r2=1831822&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Fri May 18 03:45:24 2018
@@ -19,8 +19,8 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml" rel="self" type="application/rss+xml"/>
-    <pubDate>Thu, 17 May 2018 20:11:39 -0700</pubDate>
-    <lastBuildDate>Thu, 17 May 2018 20:11:39 -0700</lastBuildDate>
+    <pubDate>Thu, 17 May 2018 20:42:51 -0700</pubDate>
+    <lastBuildDate>Thu, 17 May 2018 20:42:51 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>