You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by lu...@apache.org on 2015/09/15 04:29:23 UTC
svn commit: r1703083 - in /incubator/kylin/site: blog/2015/09/06/release-v1.0-incubating/index.html blog/2015/09/09/ blog/2015/09/09/fast-cubing-on-spark/ blog/2015/09/09/fast-cubing-on-spark/index.html blog/index.html feed.xml

Author: lukehan
Date: Tue Sep 15 02:29:22 2015
New Revision: 1703083

URL: http://svn.apache.org/r1703083
Log:
publish fast cubing blog and fix some typo

Added:
    incubator/kylin/site/blog/2015/09/09/
    incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/
    incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/index.html
Modified:
    incubator/kylin/site/blog/2015/09/06/release-v1.0-incubating/index.html
    incubator/kylin/site/blog/index.html
    incubator/kylin/site/feed.xml

Modified: incubator/kylin/site/blog/2015/09/06/release-v1.0-incubating/index.html
URL: http://svn.apache.org/viewvc/incubator/kylin/site/blog/2015/09/06/release-v1.0-incubating/index.html?rev=1703083&r1=1703082&r2=1703083&view=diff
==============================================================================
--- incubator/kylin/site/blog/2015/09/06/release-v1.0-incubating/index.html (original)
+++ incubator/kylin/site/blog/2015/09/06/release-v1.0-incubating/index.html Tue Sep 15 02:29:22 2015
@@ -196,11 +196,11 @@
 <p><strong>Kylin Core Improvement</strong></p>
 
 <ul>
-  <li>Dynamic Data Model has been supported for new added or removed column in data model without rebuild cube from the beginning <a href="https://issues.apache.org/jira/browse/KYLIN-867">KYLIN-867</a></li>
+  <li>Dynamic Data Model has been added to supporting adding or removing column in data model without rebuild cube from the beginning <a href="https://issues.apache.org/jira/browse/KYLIN-867">KYLIN-867</a></li>
   <li>Upgraded Apache Calcite to 1.3 for more bug fixes and new SQL functions <a href="https://issues.apache.org/jira/browse/KYLIN-881">KYLIN-881</a></li>
   <li>Cleanup job enhanced to make sure thereâs no garbage files left in OS and HDFS/HBase after job build <a href="https://issues.apache.org/jira/browse/KYLIN-926">KYLIN-926</a></li>
   <li>Added setting option for Hive intermediate tables created by Kylin <a href="https://issues.apache.org/jira/browse/KYLIN-883">KYLIN-883</a></li>
-  <li>HBase Corprocessor enhanced to imrpove query performance <a href="https://issues.apache.org/jira/browse/KYLIN-857">KYLIN-857</a></li>
+  <li>HBase coprocessor enhanced to imrpove query performance <a href="https://issues.apache.org/jira/browse/KYLIN-857">KYLIN-857</a></li>
   <li>Kylin System Dashboard for usage, storage, performance <a href="https://issues.apache.org/jira/browse/KYLIN-792">KYLIN-792</a></li>
 </ul>
 
@@ -217,11 +217,11 @@
 
 <p><strong>Zeppelin Integration</strong></p>
 
-<p><a href="http://zeppelin.incubator.apache.org/">Apache Zeppelin</a> is a web-based notebook that enables interactive data analytics. The Apache Kylin team has contributed Kylin Interpreter which enable Zeppelin interactive with Kylin from notebook using ANSI SQL, this interpreter could be found from Zeppelin master code repo <a href="https://github.com/apache/incubator-zeppelin/tree/master/kylin">here</a>.</p>
+<p><a href="http://zeppelin.incubator.apache.org/">Apache Zeppelin</a> is a web-based notebook that enables interactive data analytics. The Apache Kylin team has contributed Kylin Interpreter which enables Zeppelin interaction with Kylin from notebook using ANSI SQL, this interpreter could be found from Zeppelin master code repo <a href="https://github.com/apache/incubator-zeppelin/tree/master/kylin">here</a>.</p>
 
 <p><strong>Upgrade</strong></p>
 
-<p>We recommend to upgrade to this version from v0.7.x or even more early version for better performance, stablility and more clear one (most of the intermediate files will be cleaned up automatically). Also to keep up to date with community with latest features and supports.<br />
+<p>We recommend to upgrade to this version from v0.7.x or even more early version for better performance, stablility and clear one (most of the intermediate files will be cleaned up automatically). Also to keep up to date with community with latest features and supports.<br />
 Any issue or question during upgrade, please send to Apache Kylin dev mailing list: <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#064;&#107;&#121;&#108;&#105;&#110;&#046;&#105;&#110;&#099;&#117;&#098;&#097;&#116;&#111;&#114;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">&#100;&#101;&#118;&#064;&#107;&#121;&#108;&#105;&#110;&#046;&#105;&#110;&#099;&#117;&#098;&#097;&#116;&#111;&#114;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;</a></p>
 
 <p><em>Great thanks to everyone who contributed!</em></p>

Added: incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/index.html
URL: http://svn.apache.org/viewvc/incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/index.html?rev=1703083&view=auto
==============================================================================
--- incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/index.html (added)
+++ incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/index.html Tue Sep 15 02:29:22 2015
@@ -0,0 +1,373 @@
+<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+<!doctype html>
+<html>
+	<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+
+  <title>Apache Kylin | Fast Cubing on Spark in Apache Kylin</title>
+  <meta name="description" content="Preparation">
+  <meta name="author"      content="Apache Kylin">
+  <link rel="shortcut icon" href="fav.png" type="image/png">
+
+
+
+<link rel="stylesheet" href="/assets/css/animate.css">
+<!-- Bootstrap -->
+<link rel="stylesheet" href="/assets/css/bootstrap.min.css">
+
+<!-- Fonts -->
+<!-- <link rel="stylesheet" href="http://fonts.googleapis.com/css?family=Alice|Open+Sans:400,300,700"> -->
+
+<!-- Icons -->
+<link rel="stylesheet" href="/assets/css/font-awesome.min.css">
+
+  <!-- Custom styles -->
+  <link rel="stylesheet" href="/assets/css/styles.css">
+  <link rel="stylesheet" href="/assets/css/docs.css">
+  <link rel="stylesheet" href="/assets/css/pygments.css">
+
+  <link rel="canonical" href="http://kylin.incubator.apache.org/blog/2015/09/09/fast-cubing-on-spark/">
+  <link rel="alternate" type="application/rss+xml" title="Apache Kylin" href="http://kylin.incubator.apache.org/feed.xml" />
+
+<!--[if lt IE 9]> <script src="assets/js/html5shiv.js"></script> <![endif]-->
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+  //oringal tracker for kylin.io
+  ga('create', 'UA-55534813-1', 'auto');
+  //new tracker for kylin.incubator.apache.org
+  ga('create', 'UA-55534813-2', 'auto', {'name':'incubator'});
+
+  ga('send', 'pageview');
+  ga('incubator.send', 'pageview');
+
+
+</script>
+<script type="text/javascript" src="/assets/js/jquery-1.9.1.min.js"></script>
+<script type="text/javascript" src="/assets/js/nside.js"></script> </script>
+<script type="text/javascript" src="/assets/js/nnav.js"></script> </script>
+</head>
+
+	<body>
+		<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<header id="header" >
+  
+  <div id="head" class="parallax" parallax-speed="3" >
+    <div id="logo" class="text-center"> <img class="img-circle" id="circlelogo" src="/assets/images/kylin_logo.jpg"> <span class="title" >Apache Kylin</span> <span class="tagline">Extreme OLAP Engine for Big Data</span> 
+    </div>
+  </div>
+  
+
+  <!-- Main Menu -->
+  <nav class="navbar navbar-default" role="navigation" id="nav-wrapper">
+  <div class="container-fluid" id="nav">
+    <!--
+    <img class="img-circle" width="40px" height="40px" id="circlelogo" src="/assets/images/kylin_logo.jpg">
+    -->
+    <!-- Brand and toggle get grouped for better mobile display -->
+    <div class="navbar-header">
+      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+     
+    </div>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
+      <ul class="nav navbar-nav">
+     <li><a href="/">Home</a></li>
+          <li><a href="/docs" >Docs</a></li>
+          <li><a href="/download">Download</li>
+          <li><a href="/community" >Community</a></li>
+          <li><a href="/development" >Development</a></li>
+          <li><a href="/blog">Blog</li>
+          <li><a href="/cn" >ä¸æç</a></li>  
+          <li><a href="https://twitter.com/apachekylin" target="_blank" class="fa fa-twitter fa-lg" title="Twitter: @ApacheKylin" ></a></li>
+          <li><a href="https://github.com/apache/incubator-kylin" target="_blank" class="fa fa-github-alt fa-lg" title="Github: apache/incubator-kylin" ></a></li>          
+          <li><a href="https://www.facebook.com/kylinio" target="_blank" class="fa fa-facebook fa-lg" title="Facebook: kylin.io" ></a></li>   
+      </ul>      
+    </div><!-- /.navbar-collapse -->
+  </div><!-- /.container-fluid -->
+</nav>
+ </header>
+
+		<div class="page-content">
+			<header style=" padding:2em 0 0 0">
+			<div class="container" >
+				<h4 class="section-title"><span>Kylin Technical Blog</span></h4>
+			</div>
+		</div>
+
+		<div class="container">
+			<div>
+				<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<div class="post" style=" padding:2em 4em 4em 4em">
+
+  <header class="post-header">
+    <h1 class="post-title">Fast Cubing on Spark in Apache Kylin</h1>
+    <p class="post-meta" >Sep 9, 2015 â¢ Qianhao Zhou</p>
+  </header>
+
+  <article class="post-content" >
+    <h2 id="preparation">Preparation</h2>
+
+<p>In order to make POC phase as simple as possible, a standalone spark cluster is the best choice.<br />
+So the environment setup is as below:</p>
+
+<ol>
+  <li>
+    <p>hadoop sandbox (hortonworks hdp 2.2.0)</p>
+
+    <p>(8 cores, 16G) * 1</p>
+  </li>
+  <li>
+    <p>spark (1.4.1)</p>
+
+    <p>master:(4 cores, 8G)</p>
+
+    <p>worker:(4 cores, 8G) * 2</p>
+  </li>
+</ol>
+
+<p>The hadoop conf should also be in the SPARK_HOME/conf</p>
+
+<h2 id="fast-cubing-implementation-on-spark">Fast Cubing Implementation on Spark</h2>
+
+<p>Spark as a computation framework has provided much richer operators than map-reduce. And some of them are quite suitable for the cubing algorithm, for instance <strong>aggregate</strong>.</p>
+
+<p>As the <a href="http://kylin.incubator.apache.org/blog/2015/08/15/fast-cubing/" title="Fast Cubing Algorithm in Apache Kylin">Fast cubing algorithm</a>, it contains several steps:</p>
+
+<ol>
+  <li>build dictionary</li>
+  <li>calculate region split for hbase</li>
+  <li>build &amp; output cuboid data</li>
+</ol>
+
+<hr />
+
+<p><strong>build dictionary</strong></p>
+
+<p>In order to build dictionary, distinct values of the column are needed, which new API <strong><em>DataFrame</em></strong> has already provided(since spark 1.3.0).</p>
+
+<p>So after got the data from the hive through SparkSQL, it is quite natural to directly use the api to build dictionary.</p>
+
+<hr />
+
+<p><strong>calculate region split</strong></p>
+
+<p>In order to calculate the distribution of all cuboids, Kylin use a HyperLogLog implementation. And each record will have a counter, whose size is by default 16KB each. So if the counter shuffles across the cluster, that will be very expensive.</p>
+
+<p>Spark has provided an operator <strong><em>aggregate</em></strong> to reduce shuffle size. It first does a map-reduce phase locally, and then another round of reduce to merge the data from each node.</p>
+
+<hr />
+
+<p><strong>build &amp; output cuboid data</strong></p>
+
+<p>In order to build cube, Kylin requires a small batch which can fit into memory in the same time.</p>
+
+<p>Previously in map-reduce implementation, Kylin leverage the life-cycle callback <strong>cleanup</strong> to gather all the input together as a batch. This cannot be directly applied in the map reduce operator in spark which we donât have such life-cycle callback.</p>
+
+<p>However spark has provided an operator <strong><em>glom</em></strong> which coalescing all elements within each partition into an array which is exactly Kylin want to build a small batch.</p>
+
+<p>Once the batch data is ready, we can just apply the Fast Cubing algorithm.</p>
+
+<p>Then spark api <strong><em>saveAsNewAPIHadoopFile</em></strong> allow us to write hfile to hdfs and bulk load to HBase.</p>
+
+<h2 id="statistics">Statistics</h2>
+
+<p>We use the sample data Kylin provided to build cube, total record count is 10000.</p>
+
+<p>Below are results(system environments are mentioned above)</p>
+<table>
+    <tr>
+        <td></td>
+        <td>Spark</td>
+        <td>MR</td>
+    </tr>
+    <tr>
+        <td>Duration</td>
+        <td>5.5 min</td>
+        <td>10+ min</td>
+    </tr>
+</table>
+
+<h2 id="issues">Issues</h2>
+
+<p>Since hdp 2.2+ requires Hive 0.14.0 while spark 1.3.0 only supports Hive 0.13.0. There are several compatibility problems in hive-site.xml we need to fix.</p>
+
+<ol>
+  <li>
+    <p>some time-related settings</p>
+
+    <p>There are several settings, whose default value in hive 0.14.0 cannot be parsed in 0.13.0. Such as <strong>hive.metastore.client.connect.retry.delay</strong>, its default value is <strong>5s</strong>. And in hive 0.13.0, this value can only be in the format of Long value. So you have to manually change to from <strong>5s</strong> to <strong>5</strong>.</p>
+  </li>
+  <li>
+    <p>hive.security.authorization.manager</p>
+
+    <p>If you have enabled this configuration, its default value is <strong>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory</strong> which is newly introduced in hive 0.14.0, it means you have to use the another implementation, such as <strong>org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider</strong></p>
+  </li>
+  <li>
+    <p>hive.execution.engine</p>
+
+    <p>In hive 0.14.0, the default value of <strong>hive.execution.engine</strong> is <strong>tez</strong>, change it to <strong>mr</strong> in the Spark classpath, otherwise there will be NoClassDefFoundError.</p>
+  </li>
+</ol>
+
+<p>NOTE: Spark 1.4.0 has a <a href="https://issues.apache.org/jira/browse/SPARK-8368">bug</a> which will lead to ClassNotFoundException. And it has been fixed in Spark 1.4.1. So if you are planning to run on Spark 1.4.0, you may need to upgrade to 1.4.1</p>
+
+<p>Last but not least, when you trying to run Spark application on YARN, make sure that you have hive-site.xml and hbase-site.xml in the  HADDOP_CONF_DIR or YARN_CONF_DIR. Since by default HDP lays these conf in separate directories.</p>
+
+<h2 id="next-move">Next move</h2>
+
+<p>Clearly above is not a fair competition. The environment is not the same, test data size is too small, etc.</p>
+
+<p>However it showed that it is practical to migrate from MR to Spark, while some useful operators in Spark will save us quite a few codes.</p>
+
+<p>So the next move for us is to setup a cluster, do the benchmark on real data set for both MR and Spark.</p>
+
+<p>We will update the benchmark once we finished, please stay tuned.</p>
+
+  </article>
+
+</div>
+
+
+
+
+
+			</div>
+		</div>		
+		<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<footer id="underfooter">
+  <div class="container">
+    <div class="row">
+      <div class="col-md-12 widget" >
+        <div class="widget-body" style="text-align:center">
+          <div>
+          Apache Kylin is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
+          </div>
+        <a href="http://www.apache.org">
+            <img id="asf-logo" alt="Apache Software Foundation" src="/assets/images/feather-small.gif">
+        </a>
+        <a href="http://incubator.apache.org/">
+            <img id="incubator-logo" alt="Apache Incubator" src="/assets/images/egg-logo.png">
+        </a>
+
+        <div id="copyright">
+            <p>Copyright &#169; 2014 The Apache Software Foundation, Licensed under the <a
+                    href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.<br/>Apache, the
+                Apache feather logo, and the Apache Incubator project logo are trademarks of The Apache Software
+                Foundation.</p>
+        </div>
+        </div>
+      </div>
+    </div>
+    <!-- /row of widgets --> 
+
+  </div>
+  <div></div>
+  
+</footer>
+
+	<script src="/assets/js/jquery-1.9.1.min.js"></script> 
+	<script src="/assets/js/bootstrap.min.js"></script> 
+	<script src="/assets/js/main.js"></script>
+	</body>
+</html>
+
+
+
+

Modified: incubator/kylin/site/blog/index.html
URL: http://svn.apache.org/viewvc/incubator/kylin/site/blog/index.html?rev=1703083&r1=1703082&r2=1703083&view=diff
==============================================================================
--- incubator/kylin/site/blog/index.html (original)
+++ incubator/kylin/site/blog/index.html Tue Sep 15 02:29:22 2015
@@ -174,6 +174,12 @@
             
             <li>
         <h2 align="left">
+          <a class="post-link" href="/blog/2015/09/09/fast-cubing-on-spark/">Fast Cubing on Spark in Apache Kylin</a></h2><div align="left" class="post-meta">posted: Sep 9, 2015</div>
+        
+      </li>
+    
+            <li>
+        <h2 align="left">
           <a class="post-link" href="/blog/2015/09/06/release-v1.0-incubating/">Apache Kylin 1.0 (incubating) Release Announcement</a></h2><div align="left" class="post-meta">posted: Sep 6, 2015</div>
         
       </li>

Modified: incubator/kylin/site/feed.xml
URL: http://svn.apache.org/viewvc/incubator/kylin/site/feed.xml?rev=1703083&r1=1703082&r2=1703083&view=diff
==============================================================================
--- incubator/kylin/site/feed.xml (original)
+++ incubator/kylin/site/feed.xml Tue Sep 15 02:29:22 2015
@@ -19,11 +19,140 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.incubator.apache.org/</link>
     <atom:link href="http://kylin.incubator.apache.org/feed.xml" rel="self" type="application/rss+xml"/>
-    <pubDate>Mon, 07 Sep 2015 18:48:58 -0700</pubDate>
-    <lastBuildDate>Mon, 07 Sep 2015 18:48:58 -0700</lastBuildDate>
+    <pubDate>Mon, 14 Sep 2015 19:28:08 -0700</pubDate>
+    <lastBuildDate>Mon, 14 Sep 2015 19:28:08 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
+        <title>Fast Cubing on Spark in Apache Kylin</title>
+        <description>&lt;h2 id=&quot;preparation&quot;&gt;Preparation&lt;/h2&gt;
+
+&lt;p&gt;In order to make POC phase as simple as possible, a standalone spark cluster is the best choice.&lt;br /&gt;
+So the environment setup is as below:&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;hadoop sandbox (hortonworks hdp 2.2.0)&lt;/p&gt;
+
+    &lt;p&gt;(8 cores, 16G) * 1&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;spark (1.4.1)&lt;/p&gt;
+
+    &lt;p&gt;master:(4 cores, 8G)&lt;/p&gt;
+
+    &lt;p&gt;worker:(4 cores, 8G) * 2&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;The hadoop conf should also be in the SPARK_HOME/conf&lt;/p&gt;
+
+&lt;h2 id=&quot;fast-cubing-implementation-on-spark&quot;&gt;Fast Cubing Implementation on Spark&lt;/h2&gt;
+
+&lt;p&gt;Spark as a computation framework has provided much richer operators than map-reduce. And some of them are quite suitable for the cubing algorithm, for instance &lt;strong&gt;aggregate&lt;/strong&gt;.&lt;/p&gt;
+
+&lt;p&gt;As the &lt;a href=&quot;http://kylin.incubator.apache.org/blog/2015/08/15/fast-cubing/&quot; title=&quot;Fast Cubing Algorithm in Apache Kylin&quot;&gt;Fast cubing algorithm&lt;/a&gt;, it contains several steps:&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;build dictionary&lt;/li&gt;
+  &lt;li&gt;calculate region split for hbase&lt;/li&gt;
+  &lt;li&gt;build &amp;amp; output cuboid data&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;hr /&gt;
+
+&lt;p&gt;&lt;strong&gt;build dictionary&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;In order to build dictionary, distinct values of the column are needed, which new API &lt;strong&gt;&lt;em&gt;DataFrame&lt;/em&gt;&lt;/strong&gt; has already provided(since spark 1.3.0).&lt;/p&gt;
+
+&lt;p&gt;So after got the data from the hive through SparkSQL, it is quite natural to directly use the api to build dictionary.&lt;/p&gt;
+
+&lt;hr /&gt;
+
+&lt;p&gt;&lt;strong&gt;calculate region split&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;In order to calculate the distribution of all cuboids, Kylin use a HyperLogLog implementation. And each record will have a counter, whose size is by default 16KB each. So if the counter shuffles across the cluster, that will be very expensive.&lt;/p&gt;
+
+&lt;p&gt;Spark has provided an operator &lt;strong&gt;&lt;em&gt;aggregate&lt;/em&gt;&lt;/strong&gt; to reduce shuffle size. It first does a map-reduce phase locally, and then another round of reduce to merge the data from each node.&lt;/p&gt;
+
+&lt;hr /&gt;
+
+&lt;p&gt;&lt;strong&gt;build &amp;amp; output cuboid data&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;In order to build cube, Kylin requires a small batch which can fit into memory in the same time.&lt;/p&gt;
+
+&lt;p&gt;Previously in map-reduce implementation, Kylin leverage the life-cycle callback &lt;strong&gt;cleanup&lt;/strong&gt; to gather all the input together as a batch. This cannot be directly applied in the map reduce operator in spark which we donât have such life-cycle callback.&lt;/p&gt;
+
+&lt;p&gt;However spark has provided an operator &lt;strong&gt;&lt;em&gt;glom&lt;/em&gt;&lt;/strong&gt; which coalescing all elements within each partition into an array which is exactly Kylin want to build a small batch.&lt;/p&gt;
+
+&lt;p&gt;Once the batch data is ready, we can just apply the Fast Cubing algorithm.&lt;/p&gt;
+
+&lt;p&gt;Then spark api &lt;strong&gt;&lt;em&gt;saveAsNewAPIHadoopFile&lt;/em&gt;&lt;/strong&gt; allow us to write hfile to hdfs and bulk load to HBase.&lt;/p&gt;
+
+&lt;h2 id=&quot;statistics&quot;&gt;Statistics&lt;/h2&gt;
+
+&lt;p&gt;We use the sample data Kylin provided to build cube, total record count is 10000.&lt;/p&gt;
+
+&lt;p&gt;Below are results(system environments are mentioned above)&lt;/p&gt;
+&lt;table&gt;
+    &lt;tr&gt;
+        &lt;td&gt;&lt;/td&gt;
+        &lt;td&gt;Spark&lt;/td&gt;
+        &lt;td&gt;MR&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+        &lt;td&gt;Duration&lt;/td&gt;
+        &lt;td&gt;5.5 min&lt;/td&gt;
+        &lt;td&gt;10+ min&lt;/td&gt;
+    &lt;/tr&gt;
+&lt;/table&gt;
+
+&lt;h2 id=&quot;issues&quot;&gt;Issues&lt;/h2&gt;
+
+&lt;p&gt;Since hdp 2.2+ requires Hive 0.14.0 while spark 1.3.0 only supports Hive 0.13.0. There are several compatibility problems in hive-site.xml we need to fix.&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;some time-related settings&lt;/p&gt;
+
+    &lt;p&gt;There are several settings, whose default value in hive 0.14.0 cannot be parsed in 0.13.0. Such as &lt;strong&gt;hive.metastore.client.connect.retry.delay&lt;/strong&gt;, its default value is &lt;strong&gt;5s&lt;/strong&gt;. And in hive 0.13.0, this value can only be in the format of Long value. So you have to manually change to from &lt;strong&gt;5s&lt;/strong&gt; to &lt;strong&gt;5&lt;/strong&gt;.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;hive.security.authorization.manager&lt;/p&gt;
+
+    &lt;p&gt;If you have enabled this configuration, its default value is &lt;strong&gt;org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory&lt;/strong&gt; which is newly introduced in hive 0.14.0, it means you have to use the another implementation, such as &lt;strong&gt;org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider&lt;/strong&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;hive.execution.engine&lt;/p&gt;
+
+    &lt;p&gt;In hive 0.14.0, the default value of &lt;strong&gt;hive.execution.engine&lt;/strong&gt; is &lt;strong&gt;tez&lt;/strong&gt;, change it to &lt;strong&gt;mr&lt;/strong&gt; in the Spark classpath, otherwise there will be NoClassDefFoundError.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;NOTE: Spark 1.4.0 has a &lt;a href=&quot;https://issues.apache.org/jira/browse/SPARK-8368&quot;&gt;bug&lt;/a&gt; which will lead to ClassNotFoundException. And it has been fixed in Spark 1.4.1. So if you are planning to run on Spark 1.4.0, you may need to upgrade to 1.4.1&lt;/p&gt;
+
+&lt;p&gt;Last but not least, when you trying to run Spark application on YARN, make sure that you have hive-site.xml and hbase-site.xml in the  HADDOP_CONF_DIR or YARN_CONF_DIR. Since by default HDP lays these conf in separate directories.&lt;/p&gt;
+
+&lt;h2 id=&quot;next-move&quot;&gt;Next move&lt;/h2&gt;
+
+&lt;p&gt;Clearly above is not a fair competition. The environment is not the same, test data size is too small, etc.&lt;/p&gt;
+
+&lt;p&gt;However it showed that it is practical to migrate from MR to Spark, while some useful operators in Spark will save us quite a few codes.&lt;/p&gt;
+
+&lt;p&gt;So the next move for us is to setup a cluster, do the benchmark on real data set for both MR and Spark.&lt;/p&gt;
+
+&lt;p&gt;We will update the benchmark once we finished, please stay tuned.&lt;/p&gt;
+</description>
+        <pubDate>Wed, 09 Sep 2015 08:28:00 -0700</pubDate>
+        <link>http://kylin.incubator.apache.org/blog/2015/09/09/fast-cubing-on-spark/</link>
+        <guid isPermaLink="true">http://kylin.incubator.apache.org/blog/2015/09/09/fast-cubing-on-spark/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>Apache Kylin 1.0 (incubating) Release Announcement</title>
         <description>&lt;p&gt;The Apache Kylin team is pleased to announce the release of Apache Kylin v1.0 (incubating). Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.&lt;/p&gt;
 
@@ -36,11 +165,11 @@
 &lt;p&gt;&lt;strong&gt;Kylin Core Improvement&lt;/strong&gt;&lt;/p&gt;
 
 &lt;ul&gt;
-  &lt;li&gt;Dynamic Data Model has been supported for new added or removed column in data model without rebuild cube from the beginning &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-867&quot;&gt;KYLIN-867&lt;/a&gt;&lt;/li&gt;
+  &lt;li&gt;Dynamic Data Model has been added to supporting adding or removing column in data model without rebuild cube from the beginning &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-867&quot;&gt;KYLIN-867&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Upgraded Apache Calcite to 1.3 for more bug fixes and new SQL functions &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-881&quot;&gt;KYLIN-881&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Cleanup job enhanced to make sure thereâs no garbage files left in OS and HDFS/HBase after job build &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-926&quot;&gt;KYLIN-926&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Added setting option for Hive intermediate tables created by Kylin &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-883&quot;&gt;KYLIN-883&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;HBase Corprocessor enhanced to imrpove query performance &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-857&quot;&gt;KYLIN-857&lt;/a&gt;&lt;/li&gt;
+  &lt;li&gt;HBase coprocessor enhanced to imrpove query performance &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-857&quot;&gt;KYLIN-857&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Kylin System Dashboard for usage, storage, performance &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-792&quot;&gt;KYLIN-792&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 
@@ -57,11 +186,11 @@
 
 &lt;p&gt;&lt;strong&gt;Zeppelin Integration&lt;/strong&gt;&lt;/p&gt;
 
-&lt;p&gt;&lt;a href=&quot;http://zeppelin.incubator.apache.org/&quot;&gt;Apache Zeppelin&lt;/a&gt; is a web-based notebook that enables interactive data analytics. The Apache Kylin team has contributed Kylin Interpreter which enable Zeppelin interactive with Kylin from notebook using ANSI SQL, this interpreter could be found from Zeppelin master code repo &lt;a href=&quot;https://github.com/apache/incubator-zeppelin/tree/master/kylin&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;&lt;a href=&quot;http://zeppelin.incubator.apache.org/&quot;&gt;Apache Zeppelin&lt;/a&gt; is a web-based notebook that enables interactive data analytics. The Apache Kylin team has contributed Kylin Interpreter which enables Zeppelin interaction with Kylin from notebook using ANSI SQL, this interpreter could be found from Zeppelin master code repo &lt;a href=&quot;https://github.com/apache/incubator-zeppelin/tree/master/kylin&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
 
 &lt;p&gt;&lt;strong&gt;Upgrade&lt;/strong&gt;&lt;/p&gt;
 
-&lt;p&gt;We recommend to upgrade to this version from v0.7.x or even more early version for better performance, stablility and more clear one (most of the intermediate files will be cleaned up automatically). Also to keep up to date with community with latest features and supports.&lt;br /&gt;
+&lt;p&gt;We recommend to upgrade to this version from v0.7.x or even more early version for better performance, stablility and clear one (most of the intermediate files will be cleaned up automatically). Also to keep up to date with community with latest features and supports.&lt;br /&gt;
 Any issue or question during upgrade, please send to Apache Kylin dev mailing list: &lt;a href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#100;&amp;#101;&amp;#118;&amp;#064;&amp;#107;&amp;#121;&amp;#108;&amp;#105;&amp;#110;&amp;#046;&amp;#105;&amp;#110;&amp;#099;&amp;#117;&amp;#098;&amp;#097;&amp;#116;&amp;#111;&amp;#114;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;&amp;#100;&amp;#101;&amp;#118;&amp;#064;&amp;#107;&amp;#121;&amp;#108;&amp;#105;&amp;#110;&amp;#046;&amp;#105;&amp;#110;&amp;#099;&amp;#117;&amp;#098;&amp;#097;&amp;#116;&amp;#111;&amp;#114;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&lt;/a&gt;&lt;/p&gt;
 
 &lt;p&gt;&lt;em&gt;Great thanks to everyone who contributed!&lt;/em&gt;&lt;/p&gt;