You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by li...@apache.org on 2015/09/19 01:47:12 UTC
[36/50] [abbrv] incubator-kylin git commit: blog for "fast cubing on spark"

blog for "fast cubing on spark"


Project: http://git-wip-us.apache.org/repos/asf/incubator-kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kylin/commit/fbe3395f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kylin/tree/fbe3395f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kylin/diff/fbe3395f

Branch: refs/heads/master
Commit: fbe3395f29b2d872d062d4a0f3b8e9e9f011f1bc
Parents: 419798d
Author: qianhao.zhou <qi...@ebay.com>
Authored: Wed Sep 9 15:45:39 2015 +0800
Committer: qianhao.zhou <qi...@ebay.com>
Committed: Wed Sep 9 15:45:39 2015 +0800

----------------------------------------------------------------------
 .../blog/2015-09-09-fast-cubing-on-spark.md     | 112 +++++++++++++++++++
 1 file changed, 112 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/fbe3395f/website/_posts/blog/2015-09-09-fast-cubing-on-spark.md
----------------------------------------------------------------------
diff --git a/website/_posts/blog/2015-09-09-fast-cubing-on-spark.md b/website/_posts/blog/2015-09-09-fast-cubing-on-spark.md
new file mode 100644
index 0000000..fd9e56d
--- /dev/null
+++ b/website/_posts/blog/2015-09-09-fast-cubing-on-spark.md
@@ -0,0 +1,112 @@
+# Fast Cubing on Spark in Apache Kylin
+
+
+## Preparation
+
+In order to make POC phase as simple as possible, a standalone spark cluster is the best choice.
+So the environment setup is as below:
+
+1. hadoop sandbox (hortonworks hdp 2.2.0)
+	
+	(8 cores, 16G) * 1
+
+2. spark (1.4.1)
+
+	master:(4 cores, 8G)
+	
+	worker:(4 cores, 8G) * 2
+	
+The hadoop conf should also be in the SPARK_HOME/conf
+
+
+## Fast Cubing Implementation on Spark
+
+Spark as a computation framework has provided much richer operators than map-reduce. And some of them are quite suitable for the cubing algorithm, for instance **aggregate**.
+
+As the [Fast cubing algorithm](http://kylin.incubator.apache.org/blog/2015/08/15/fast-cubing/ "Fast Cubing Algorithm in Apache Kylin"), it contains several steps:
+
+1. build dictionary
+2. calculate region split for hbase
+3. build & output cuboid data
+
+----
+
+**build dictionary**
+
+In order to build dictionary, distinct values of the column are needed, which new API ***DataFrame*** has already provided(since spark 1.3.0).
+
+So after got the data from the hive through SparkSQL, it is quite natural to directly use the api to build dictionary.
+
+----
+
+**calculate region split**
+
+In order to calculate the distribution of all cuboids, Kylin use a HyperLogLog implementation. And each record will have a counter, whose size is by default 16KB each. So if the counter shuffles across the cluster, that will be very expensive.
+
+Spark has provided an operator ***aggregate*** to reduce shuffle size. It first does a map-reduce phase locally, and then another round of reduce to merge the data from each node.
+
+----
+
+**build & output cuboid data**
+
+In order to build cube, Kylin requires a small batch which can fit into memory in the same time.
+
+Previously in map-reduce implementation, Kylin leverage the life-cycle callback **cleanup** to gather all the input together as a batch. This cannot be directly applied in the map reduce operator in spark which we don't have such life-cycle callback.
+
+However spark has provided an operator ***glom*** which coalescing all elements within each partition into an array which is exactly Kylin want to build a small batch.
+
+Once the batch data is ready, we can just apply the Fast Cubing algorithm. 
+
+Then spark api ***saveAsNewAPIHadoopFile*** allow us to write hfile to hdfs and bulk load to HBase.
+
+
+## Statistics
+
+We use the sample data Kylin provided to build cube, total record count is 10000.
+
+Below are results(system environments are mentioned above)
+<table>
+    <tr>
+        <td></td>
+        <td>Spark</td>
+        <td>MR</td>
+    </tr>
+    <tr>
+        <td>Duration</td>
+        <td>5.5 min</td>
+        <td>10+ min</td>
+    </tr>
+</table>
+
+## Issues
+
+Since hdp 2.2+ requires Hive 0.14.0 while spark 1.3.0 only supports Hive 0.13.0. There are several compatibility problems in hive-site.xml we need to fix.
+
+1. some time-related settings
+
+    There are several settings, whose default value in hive 0.14.0 cannot be parsed in 0.13.0. Such as **hive.metastore.client.connect.retry.delay**, its default value is **5s**. And in hive 0.13.0, this value can only be in the format of Long value. So you have to manually change to from **5s** to **5**.
+
+2. hive.security.authorization.manager
+
+    If you have enabled this configuration, its default value is **org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory** which is newly introduced in hive 0.14.0, it means you have to use the another implementation, such as **org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider**
+
+3. hive.execution.engine
+
+    In hive 0.14.0, the default value of **hive.execution.engine** is **tez**, change it to **mr** in the Spark classpath, otherwise there will be NoClassDefFoundError.
+
+NOTE: Spark 1.4.0 has a [bug](https://issues.apache.org/jira/browse/SPARK-8368) which will lead to ClassNotFoundException. And it has been fixed in Spark 1.4.1. So if you are planning to run on Spark 1.4.0, you may need to upgrade to 1.4.1
+
+Last but not least, when you trying to run Spark application on YARN, make sure that you have hive-site.xml and hbase-site.xml in the  HADDOP_CONF_DIR or YARN_CONF_DIR. Since by default HDP lays these conf in separate directories.
+
+
+
+
+## Next move
+
+Clearly above is not a fair competition. The environment is not the same, test data size is too small, etc.
+
+However it showed that it is practical to migrate from MR to Spark, while some useful operators in Spark will save us quite a few codes.
+
+So the next move for us is to setup a cluster, do the benchmark on real data set for both MR and Spark.
+
+We will update the benchmark once we finished, please stay tuned.
\ No newline at end of file