You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by li...@apache.org on 2021/06/18 02:57:27 UTC

svn commit: r1890886 [8/8] - in /kylin/site: ./ blog/ blog/2021/ blog/2021/06/ blog/2021/06/17/ blog/2021/06/17/Why-did-Youzan-choose-Kylin4/ cn/blog/ cn/docs40/ cn/docs40/gettingstarted/ cn/docs40/howto/ cn/docs40/install/ cn/docs40/tutorial/ cn_blog/...

Modified: kylin/site/feed.xml
URL: http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1890886&r1=1890885&r2=1890886&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Fri Jun 18 02:57:25 2021
@@ -19,11 +19,319 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml" rel="self" type="application/rss+xml"/>
-    <pubDate>Wed, 02 Jun 2021 20:18:35 -0700</pubDate>
-    <lastBuildDate>Wed, 02 Jun 2021 20:18:35 -0700</lastBuildDate>
+    <pubDate>Thu, 17 Jun 2021 19:32:15 -0700</pubDate>
+    <lastBuildDate>Thu, 17 Jun 2021 19:32:15 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
+        <title>Why did Youzan choose Kylin4</title>
+        <description>&lt;p&gt;At the QCon Global Software Developers Conference held on May 29, 2021, Zheng Shengjun, head of Youzan’s data infrastructure platform, shared Youzan’s internal use experience and optimization practice of Kylin 4.0 on the meeting room of open source big data frameworks and applications. &lt;br /&gt;
+For many users of Kylin2/3(Kylin on HBase), this is also a chance to learn how and why to upgrade to Kylin 4.&lt;/p&gt;
+
+&lt;p&gt;This sharing is mainly divided into the following parts:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;The reason for choosing Kylin 4&lt;/li&gt;
+  &lt;li&gt;Introduction to Kylin 4&lt;/li&gt;
+  &lt;li&gt;How to optimize performance of Kylin 4&lt;/li&gt;
+  &lt;li&gt;Practice of Kylin 4 in Youzan&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;the-reason-for-choosing-kylin-4&quot;&gt;01 The reason for choosing Kylin 4&lt;/h2&gt;
+
+&lt;h3 id=&quot;introduction-to-youzan&quot;&gt;Introduction to Youzan&lt;/h3&gt;
+&lt;p&gt;China Youzan Co., Ltd (stock code 08083.HK). is an enterprise mainly engaged in retail technology services.&lt;br /&gt;
+At present, it owns several tools and solutions to provide SaaS software products and talent services to help merchants operate mobile social e-commerce and new retail channels in an all-round way. &lt;br /&gt;
+Currently Youzan has hundreds of millions of consumers and 6 million existing merchants.&lt;/p&gt;
+
+&lt;h3 id=&quot;history-of-kylin-in-youzan&quot;&gt;History of Kylin in Youzan&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/1 history_of_youzan_OLAP.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;First of all, I would like to share why Youzan chose to upgrade to Kylin 4. Here, let me briefly reviewed the history of Youzan OLAP infra.&lt;/p&gt;
+
+&lt;p&gt;In the early days of Youzan, in order to iterate develop process quickly, we chose the method of pre-computation + MySQL; in 2018, Druid was introduced because of query flexibility and development efficiency, but there were problems such as low pre-aggregation, not supporting precisely count distinct measure. In this situation, Youzan introduced Apache Kylin and ClickHouse. Kylin supports high aggregation, precisely count distinct measure and the lowest RT, while ClickHouse is quite flexible in usage(ad hoc query).&lt;/p&gt;
+
+&lt;p&gt;From the introduction of Kylin in 2018 to now, Youzan has used Kylin for more than three years. With the continuous enrichment of business scenarios and the continuous accumulation of data volume, Youzan currently has 6 million existing merchants, GMV in 2020 is 107.3 billion, and the daily build data volume is 10 billion +. At present, Kylin has basically covered all the business scenarios of Youzan.&lt;/p&gt;
+
+&lt;h3 id=&quot;the-challenges-of-kylin-3&quot;&gt;The challenges of Kylin 3&lt;/h3&gt;
+&lt;p&gt;With Youzan’s rapid development and in-depth use of Kylin, we also encountered some challenges:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;First of all, the build performance of Kylin on HBase cannot meet the favorable expectations, and the build performance will affect the user’s failure recovery time and stability experience;&lt;/li&gt;
+  &lt;li&gt;Secondly, with the access of more large merchants (tens of millions of members in a single store, with hundreds of thousands of goods for each store), it also brings great challenges to our OLAP system. Kylin on HBase is limited by the single-point query of Query Server, and cannot support these complex scenarios well;&lt;/li&gt;
+  &lt;li&gt;Finally, because HBase is not a cloud-native system, it is difficult to achieve flexible scale up and scale down. With the continuous growth of data volume, this system has peaks and valleys for businesses, which results in the average resource utilization rate is not high enough.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Faced with these challenges, Youzan chose to move closer and upgrade to the more cloud-native Apache Kylin 4.&lt;/p&gt;
+
+&lt;h2 id=&quot;introduction-to-kylin-4&quot;&gt;02 Introduction to Kylin 4&lt;/h2&gt;
+&lt;p&gt;First of all, let’s introduce the main advantages of Kylin 4. Apache Kylin 4 completely depends on Spark for cubing job and query. It can make full use of Spark’s parallelization, quantization(向量化), and global dynamic code generation technologies to improve the efficiency of large queries.&lt;br /&gt;
+Here is a brief introduction to the principle of Kylin 4, that is storage engine, build engine and query engine.&lt;/p&gt;
+
+&lt;h3 id=&quot;storage-engine&quot;&gt;Storage engine&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/2 kylin4_storage.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;First of all, let’s take a look at the new storage engine, comparison between Kylin on HBase and Kylin on Parquet. The cuboid data of Kylin on HBase is stored in the table of HBase. Single Segment corresponds to one HBase table. Aggregation is pushed down to HBase coprocessor.&lt;/p&gt;
+
+&lt;p&gt;But as we know,  HBase is not a real Columnar Storage and its throughput is not enough for OLAP System. Kylin 4 replaces HBase with Parquet, all the data is stored in files. Each segment will have a corresponding HDFS directory. All queries and cubing jobs read and write files without HBase . Although there will be a certain loss of performance for simple queries, the improvement brought about by complex queries is more considerable and worthwhile.&lt;/p&gt;
+
+&lt;h3 id=&quot;build-engine&quot;&gt;Build engine&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/3 kylin4_build_engine.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;The second is the new build engine. Based on our test, the build speed of Kylin on Parquet has been optimized from 82 minutes to 15 minutes. There are several reasons:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Kylin 4 removes the encoding of the dimension, eliminating a building step of encoding;&lt;/li&gt;
+  &lt;li&gt;Removed the HBase File generation step;&lt;/li&gt;
+  &lt;li&gt;Kylin on Parquet changes the granularity of cubing to cuboid level, which is conducive to further improving parallelism of cubing job.&lt;/li&gt;
+  &lt;li&gt;Enhanced implementation for global dictionary. In the new algorithm, dictionary and source data are hashed into the same buckets, making it possible for loading only piece of dictionary bucket to encode source data.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;As you can see on the right, after upgradation to Kylin 4, cubing job changes from ten steps to two steps, the performance improvement of the construction is very obvious.&lt;/p&gt;
+
+&lt;h3 id=&quot;query-engine&quot;&gt;Query engine&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/4 kylin4_query.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Next is the new query engine of Kylin 4. As you can see, the calculation of Kylin on HBase is completely dependent on the coprocessor of HBase and query server process. When the data is read from HBase into query server to do aggregation, sorting, etc, the bottleneck will be restricted by the single point of query server. But Kylin 4 is converted to a fully distributed query mechanism based on Spark, what’s more, it ‘s able to do configuration tuning automatically in spark query step !&lt;/p&gt;
+
+&lt;h2 id=&quot;how-to-optimize-performance-of-kylin-4&quot;&gt;03 How to optimize performance of Kylin 4&lt;/h2&gt;
+&lt;p&gt;Next, I’d like to share some performance optimizations made by Youzan in Kylin 4.&lt;/p&gt;
+
+&lt;h3 id=&quot;optimization-of-query-engine&quot;&gt;Optimization of query engine&lt;/h3&gt;
+&lt;p&gt;#### 1.Cache Calcite physical plan&lt;br /&gt;
+&lt;img src=&quot;/images/blog/youzan/5 cache_calcite_plan.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;In Kylin4, SQL will be analyzed, optimized and do code generation in calcite. This step takes up about 150ms for some queries. We have supported PreparedStatementCache in Kylin4 to cache calcite plan, so that the structured SQL don’t have to do the same step again. With this optimization it saved about 150ms of time cost.&lt;/p&gt;
+
+&lt;h4 id=&quot;tunning-spark-configuration&quot;&gt;2.Tunning spark configuration&lt;/h4&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/6 tuning_spark_configuration.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Kylin4 uses spark as query engine. As spark is a distributed engine designed for massive data processing, it’s inevitable to loose some performance for small queries. We have tried to do some tuning to catch up with the latency in Kylin on HBase for small queries.&lt;/p&gt;
+
+&lt;p&gt;Our first optimization is to make more calculations finish in memory. The key is to avoid data spill during aggregation, shuffle and sort. Tuning the following configuration is helpful.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;1.set &lt;code class=&quot;highlighter-rouge&quot;&gt;spark.sql.objectHashAggregate.sortBased.fallbackThreshold&lt;/code&gt; to larger value to avoid HashAggregate fall back to Sort Based Aggregate, which really kills performance when happens.&lt;/li&gt;
+  &lt;li&gt;2.set &lt;code class=&quot;highlighter-rouge&quot;&gt;spark.shuffle.spill.initialMemoryThreshold&lt;/code&gt; to a large value to avoid to many spills during shuffle.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Secondly, we route small queries to Query Server which run spark in local mode. Because the overhead of task schedule, shuffle read and variable broadcast is enlarged for small queries on YARN/Standalone mode.&lt;/p&gt;
+
+&lt;p&gt;Thirdly, we use RAM disk to enhance shuffle performance. Mount RAM disk as TMPFS and set spark.local.dir to directory using RAM disk.&lt;/p&gt;
+
+&lt;p&gt;Lastly, we disabled spark’s whole stage code generation for small queries, for spark’s whole stage code generation will cost about 100ms~200ms, whereas it’s not beneficial to small queries which is a simple project.&lt;/p&gt;
+
+&lt;h4 id=&quot;parquet-optimization&quot;&gt;3.Parquet optimization&lt;/h4&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/7 parquet_optimization.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Optimizing parquet is also important for queries.&lt;/p&gt;
+
+&lt;p&gt;The first principal is that we’d better always include shard by column in our filter condition, for parquet files are shard by shard-by-column, filter using shard by column reduces the data files to read.&lt;/p&gt;
+
+&lt;p&gt;Then look into parquet files, data within files are sorted by rowkey columns, that is to say, prefix match in query is as important as Kylin on HBase. When a query condition satisfies prefix match, it can filter row groups with column’s max/min index. Furthermore, we can reduce row group size to make finer index granularity, but be aware that the compression rate will be lower if we set row group size smaller.&lt;/p&gt;
+
+&lt;h4 id=&quot;dynamic-elimination-of-partitioning-dimensions&quot;&gt;4.Dynamic elimination of partitioning dimensions&lt;/h4&gt;
+&lt;p&gt;Kylin4 have a new ability that the older version is not capable of, which is able to reduce dozens of times of data reading and computing for some big queries. It’s offen the case that partition column is used to filter data but not used as group dimension. For those cases Kylin would always choose cuboid with partition column, but now it is able to use different cuboid in that query to reduce IO read and computing.&lt;/p&gt;
+
+&lt;p&gt;The key of this optimization is to split a query into two parts, one of the part uses all segment’s data so that partition column doesn’t have to be included in cuboid, the other part that uses part of segments data will choose cuboid with partition dimension to do the data filter.&lt;/p&gt;
+
+&lt;p&gt;We have tested that in some situations the response time reduced from 20s to 6s, 10s to 3s.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/8 Dynamic_elimination_of_partitioning_dimensions.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;optimization-of-build-engine&quot;&gt;Optimization of build engine&lt;/h3&gt;
+&lt;p&gt;#### 1.cache parent dataset&lt;br /&gt;
+&lt;img src=&quot;/images/blog/youzan/9 cache_parent_dataset.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Kylin build cube layer by layer. For a parent layer with multi cuboids to build, we can choose to cache parent dataset by setting kylin.engine.spark.parent-dataset.max.persist.count to a number greater than 0. But notice that if you set this value too small, it will affect the parallelism of build job, as the build granularity is at cuboid level.&lt;/p&gt;
+
+&lt;h2 id=&quot;practice-of-kylin-4-in-youzan&quot;&gt;04 Practice of Kylin 4 in Youzan&lt;/h2&gt;
+&lt;p&gt;After introducing Youzan’s experience of performance optimization, let’s share the optimization effect. That is, Kylin 4’s practice in Youzan includes the upgrade process and the performance of online system.&lt;/p&gt;
+
+&lt;h3 id=&quot;upgrade-metadata-to-adapt-to-kylin-4&quot;&gt;Upgrade metadata to adapt to Kylin 4&lt;/h3&gt;
+&lt;p&gt;First of all, for metadata for Kylin 3 which stored on HBase, we have developed a tool for seamless upgrading of metadata. First of all, our metadata in Kylin on HBase is stored in HBase. We export the metadata in HBase into local files, and then use tools to transform and write back the new metadata into MySQL. We also updated the operation documents and general principles in the official wiki of Apache Kylin. For more details, you can refer to: &lt;a href=&quot;https://wiki.apache.org/confluence/display/KYLIN/How+to+migrate+metadata+to+Kylin+4&quot;&gt;How to migrate metadata to Kylin 4&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;Let’s give a general introduction to some compatibility in the whole process. The project metadata, tables metadata, permission-related metadata, and model metadata do not need be modified. What needs to be modified is the cube metadata, including the type of storage and query used by Cube. After updating these two fields, you need to recalculate the Cube signature. The function of this signature is designed internally by Kylin to avoid some problems caused by Cube after Cube is determined.&lt;/p&gt;
+
+&lt;h3 id=&quot;performance-of-kylin-4-on-youzan-online-system&quot;&gt;Performance of Kylin 4 on Youzan online system&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/10 commodity_insight.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;After the migration of metadata to Kylin4, let’s share the qualitative changes and substantial performance improvements brought about by some of the promising scenarios. First of all, in a scenario like Commodity Insight, there is a large store with several hundred thousand of commodities. We have to analyze its transactions and traffic, etc. There are more than a dozen precise precisely count distinct measures in single cube. Precisely count distinct measure is actually very inefficient if it is not optimized through pre-calculation and Bitmap. Kylin currently uses Bitmap to support precisely count distinct measure. In a scene that requires complex queries to sort hundreds of thousands of commodities in various UV(precisely count distinct measure), the RT of Kylin 2 is 27 seconds, while the RT of Kylin 4 is reduced from 27 seconds to less than 2 seconds.&lt;/p&gt;
+
+&lt;p&gt;What I find most appealing to me about Kylin 4 is that it’s like a manual transmission car, you can control its query concurrency at your will, whereas you can’t change query concurrency in Kylin on HBase freely, because its concurrency is completely tied to the number of regions.&lt;/p&gt;
+
+&lt;h3 id=&quot;plan-for-kylin-4-in-youzan&quot;&gt;Plan for Kylin 4 in Youzan&lt;/h3&gt;
+&lt;p&gt;We have made full test, fixed several bugs and improved apache KYLIN4 for several months. Now we are migrating cubes from older version to newer version. For the cubes already migrated to KYLIN4, its small queries’ performance meet our expectations, its complex query and build performance did bring us a big surprise. We are planning to migrate all cubes from older version to Kylin4.&lt;/p&gt;
+</description>
+        <pubDate>Thu, 17 Jun 2021 08:00:00 -0700</pubDate>
+        <link>http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</link>
+        <guid isPermaLink="true">http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
+        <title>有赞为什么选择 Kylin4</title>
+        <description>&lt;p&gt;在 2021年5月29日举办的 QCon 全球软件开发者大会上,来自有赞的数据基础平台负责人 郑生俊 在大数据开源框架与应用专题上分享了有赞内部对 Kylin 4.0 的使用经历和优化实践,对于众多 Kylin 老用户来说,这也是升级 Kylin 4 的实用攻略。&lt;/p&gt;
+
+&lt;p&gt;本次分享主要分为以下四个部分:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;有赞选用 Kylin 4 的原因&lt;/li&gt;
+  &lt;li&gt;Kylin 4 原理介绍&lt;/li&gt;
+  &lt;li&gt;Kylin 4 性能优化&lt;/li&gt;
+  &lt;li&gt;Kylin 4 在有赞的实践&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;kylin-4-&quot;&gt;01 有赞选用 Kylin 4 的原因&lt;/h2&gt;
+&lt;p&gt;首先分享有赞为什么会选择升级为 Kylin 4,这里先简单回顾一下有赞 OLAP 的发展历程:有赞初期为了快速迭代,选择了预计算 + MySQL 的方式;2018年,因为查询灵活和开发效率引入了 Druid,但是存在预聚合度不高、不支持精确去重和明细 OLAP 等问题;在这样的背景下,有赞引入了满足聚合度高、支持精确去重和 RT 最低的 Apache Kylin 和查询非常灵活的 ROLAP ClickHouse。&lt;/p&gt;
+
+&lt;p&gt;从2018年引入 Kylin 到现在,有赞已经使用 Kylin 三年多了。随着业务场景的不断丰富和数据量的不断积累,有赞目前有 600 万的存量商家,2020年 GMV 是 1073亿,日构建量为 100 亿+,目前 Kylin 已经基本覆盖了有赞所有的业务范围。&lt;/p&gt;
+
+&lt;p&gt;随着有赞自身的迅速发展和不断深入地使用 Kylin,我们也遇到一些挑战:&lt;br /&gt;
+- 首先 Kylin on HBase 的构建性能无法满足有赞的预期,构建性能会影响到用户的故障恢复时间和稳定性的体验;&lt;br /&gt;
+- 其次,随着更多大商家(单店千万级别会员、数十万商品)的接入,对我们的查询也带来了很大的挑战。Kylin on HBase 受限于 QueryServer 单点查询的局限,无法很好地支持这些复杂的场景;&lt;br /&gt;
+- 最后,因为 HBase 不是一个云原生系统,很难做到弹性的资源伸缩,随着数据量的不断增长,这个系统对于商家而言,使用时间是存在高峰和低谷的,这就造成平均的资源使用率不够高。&lt;/p&gt;
+
+&lt;p&gt;面对这些挑战,有赞选择去向更云原生的 Apache Kylin 4 去靠拢和升级。&lt;/p&gt;
+
+&lt;h2 id=&quot;kylin-4--1&quot;&gt;02 Kylin 4 原理介绍&lt;/h2&gt;
+&lt;p&gt;首先介绍一下 Kylin 4 的主要优势。Apache Kylin 4 是完全基于 Spark 去做构建和查询的,能够充分地利用 Spark的并行化、向量化和全局动态代码生成等技术,去提高大查询的效率。&lt;br /&gt;
+这里从存储、构建和查询三个部分简单介绍一下 Kylin 4 的原理。&lt;/p&gt;
+
+&lt;h3 id=&quot;section&quot;&gt;存储&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan_cn/1 kylin4_storage.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
+首先来看一下,Kylin on HBase 和 Kylin on Parquet 的对比。Kylin on HBase 的 Cuboid 的数据是存放在 HBase 的表里,一个 Segment 对应了一张 HBase 表,查询下推的工作由 HBase 协理器处理,因为 HBase 不是真正的列存并且对 OLAP 而言吞吐量不高。Kylin 4 将 HBase 替换为 Parquet,也就是把所有的数据按照文件存储,每个 Segment 会存在一个对应的 HDFS 的目录,所有的查询、构建都是直接通过读写文件的方式,不用再经过 HBase。虽然对于小查询的性能会有ä
 ¸€å®šæŸå¤±ï¼Œä½†å¯¹äºŽå¤æ‚查询带来的提升是更可观的、更值得的。&lt;/p&gt;
+
+&lt;h3 id=&quot;section-1&quot;&gt;构建引擎&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan_cn/2 kylin4_build_engine.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
+其次是 Kylin 构建引擎,基于有赞的测试,Kylin on Parquet 的构建速度已经从 82 分钟优化到了 15 分钟,有以下几个原因:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Kylin 4 去掉了维度字典的编码,省去了编码的一个构建步骤;&lt;/li&gt;
+  &lt;li&gt;去掉了 HBase File 的生成步骤;&lt;/li&gt;
+  &lt;li&gt;新版本的 Kylin 4 所有的构建步骤都转换为 Spark 进行构建;&lt;/li&gt;
+  &lt;li&gt;Kylin on Parquet 基于 Cuboid 去划分构建粒度,有利于进一步地提升并行度。&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;可以看到右侧,从十个步骤简化到了两个步骤,构建性能提升的非常明显的。&lt;/p&gt;
+
+&lt;h3 id=&quot;section-2&quot;&gt;查询引擎&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan_cn/3 kylin4_query.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;接下来就是 Kylin 4 的查询,大家可以看到,左边这列 Kylin on HBase 的计算是完全依托于 Calcite 和 HBase 的协处理器,这就导致当数据从 HBase 读取后,如果想做聚合、排序等,就会局限于 QueryServer 单点的瓶颈,而 Kylin 4 则转换为基于 Spark DataFrame 的全分布式的查询机制。&lt;/p&gt;
+
+&lt;h2 id=&quot;kylin-4--2&quot;&gt;03 Kylin 4 性能优化&lt;/h2&gt;
+&lt;p&gt;接下来分享有赞在 Kylin 4 所做的一些性能优化。&lt;/p&gt;
+
+&lt;h3 id=&quot;section-3&quot;&gt;查询性能优化&lt;/h3&gt;
+&lt;p&gt;#### 1.动态消除维度分区&lt;br /&gt;
+&lt;img src=&quot;/images/blog/youzan_cn/4 dynamic_elimination_dimension_partition.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;首先我们来看一个场景,我们做到了动态消除分区维度,混合使用 cuboid 来对复杂查询,减少数十倍的计算量。&lt;/p&gt;
+
+&lt;p&gt;这里举一个例子,在一个 Cube 有三个 Segment 的情况下,Cube 分区字段记作 P,它有三个 Segment 分别是1月1日到2月1日、2月1日到3月1日,3月1日到3月7日。假设有一个SQL,Select count(a) from test where p &amp;gt;= 20200101 and p &amp;lt;= 20200313 group by a。&lt;/p&gt;
+
+&lt;p&gt;在这种情况下,因为需要分区过滤,Kylin 它会选择 a 和 p 预计算维度的组合,转换成执行计划就是最上层的 Aggregate 然后 Filter,最后会转换成一个 TableScan,这个 TableScan 就是选择聚合维度为 a 和 p 这样的一个维度组合。实际上这个查询计划是适合把它优化成右边这种方式的,对于某个 Segment 完全使用到的数据,我们可以选择一个 Cuboid 为 a 的 Cuboid 去做查询。对于部分用到的分区或者 Segment,我们可以选择 a å’
 Œ p 这样的一个维度组合。通过这种方式,在 a 只有一个可能值的情况下,之前可能要 scan 65 条数据,优化后只要 scan 8 条数据。假设时间跨度更长,比如说跨几个月、半年甚至一年,就会减少数十倍、几十倍的计算量和 IO。&lt;/p&gt;
+
+&lt;p&gt;在有赞某些场景,RT 可以从 10 秒优化到 3 秒、20s 提升到 6s,对于更复杂的场景(比如计算密集型的 HLL),会有更显著的优化效果。这部分优化,有赞也正打算贡献回社区。因为涉及到如何在多层嵌套和复杂的条件下进行 segment 分组,以及目前 calcite 和 spark catalyst 并存,实现上会比较复杂。到时候大家在 Kylin 4.0-GA 版本可能就可以看到这个优化了。&lt;/p&gt;
+
+&lt;h4 id=&quot;section-4&quot;&gt;2.复杂过滤条件下的分区裁剪&lt;/h4&gt;
+&lt;p&gt;接下来再介绍一下有赞所做的查询性能优化,就是支持复杂过滤条件下的分区裁剪。目前 Kylin 4.0 Beta 版本对于复杂的过滤条件比如多个过滤字段、多层嵌套的 Filter 等,不支持分区裁剪,导致全表扫描。我们做了一个优化,是将复杂的嵌套 Filter 过滤的语法树转换成基于分区字段 p 的一个等价表达式,然后再将这个表达式应用到每一个 Segment 去做过滤,通过这样的方式,去支持它做到一个é
 žå¸¸å¤æ‚的分区过滤裁剪。&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan_cn/5 Partition clipping under complex filter.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;spark-&quot;&gt;3.Spark 参数调优&lt;/h4&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan_cn/6 tuning_spark_configuration.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;接下来是比较重要的一部分,就是关于 Spark 的调参。Spark 是一个分布式计算框架,相比 Calcite 而言,对于小查询是存在一定劣势的。&lt;/p&gt;
+
+&lt;p&gt;首先我们做了一个调整,尽量让 Spark 所有的计算操作是在内存中完成的。以下两种情况会产生 spill:&lt;br /&gt;
+- 01 在聚合时,在我们内存不够的时候,Spark 会将 HashAggregate 转换为 Sort Based Aggregate,实际上这一步是很耗性能的。我们通过调大阈值的参数,尽量让所有的聚合都在内存中完成。&lt;br /&gt;
+- 02 在 shuffle 的过程中,Spark是不可避免地会进行 Spill,会落盘,我们能做的尽量在 Shuffle 过程减少 Spill,只在最后 Shuffle 结束之后进行 Spill。&lt;/p&gt;
+
+&lt;p&gt;第二个我们做的调优是,相比 on YARN/Standalone 模式下,local 模式大部分都是在进程内通信的,也不需要产生跨网络的 Shuffle, broadcast 广播变量也不需要跨网络,所以对于小查询,我们会路由到以 Local 模式运行的 Spark Application,这对于小查询非常有意义。&lt;/p&gt;
+
+&lt;p&gt;第三个优化是 shuffle 使用内存盘。因为内存盘肯定是最快的,我们将内存盘挂载为 tmpfs 文件系统,然后将 spark.local.dir 指定为挂载的内存盘去优化 shuffle 的速度和吞吐。&lt;/p&gt;
+
+&lt;p&gt;第四个优化是我们关闭 Spark 全局动态代码生成。Spark 的全局动态代码生成是要在运行的时间内去动态拼接代码,再去动态编译代码,这个过程实际上是很耗时的。对于离线的大数据量下是很有优化意义,但是对于比较小的一些数据场景,我们关掉这个动态代码生成之后,能够节省大概 100 到 200 毫秒的耗时。&lt;/p&gt;
+
+&lt;p&gt;目前经过上述一系列的优化,我们能让小查询的 RT 稳定在大概 300 毫秒左右,尽管 HBase 可能是几十毫秒左右的 RT,但我们认为目前已经比较接近了,这种为提升大查询提升的 Tradeoff 我们认为是一个很值得的事情。&lt;/p&gt;
+
+&lt;h4 id=&quot;section-5&quot;&gt;4.小查询优化&lt;/h4&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan_cn/8 small_query_optimization.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
+然后,我来分享一下小查询的优化。Kylin on HBase 依托于 HBase 能够做到几十毫秒的 RT,因为 HBase 有 bucket cache 缓存。而 Kylin on Parquet 就完全基于文件的读取和计算,缓存依赖于文件系统的 page cache,那么它小查询的 RT 会比 HBase 更高一些,我们能做的就是尽量缩小 Kylin on Parquet 和 Kylin on HBase 的 RT 差距。&lt;/p&gt;
+
+&lt;p&gt;经过我们的分析,SQL 会通过 Calcite 解析成 Calcite 语法树,然后将这个语法树转化为 Spark DataFrame,最终再将整个查询交给 Spark 去执行。在这一步的过程中,SQL 转化成 Calcite 的过程中,是需要经过语法解析、优化等,这一步大概会消耗 150 毫秒左右。有赞做的是尽量使用结构化的 SQL,就是 PreparedStatement,我们在 Kylin 中支持 PreparedStatementCache,对于固定的 SQL 格式,将它的执行计划进行缓存,去重用这样的执行计划,降ä½
 Žè¯¥æ­¥éª¤çš„时间消耗,通过这样的优化,可以降低大概 100 毫秒左右的耗时。&lt;/p&gt;
+
+&lt;h4 id=&quot;parquet-&quot;&gt;5.Parquet 优化&lt;/h4&gt;
+
+&lt;p&gt;关于查询性能的优化,有赞还充分利用了 Parquet 索引,优化建议包括:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;Parquet 文件首先根据 Shard By Column 进行分组,过滤条件尽量包含 Shard By Column;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Parquet 中的数据依然按照维度排序,结合 Column MetaData 中的 Max、Min 索引,在命中前缀索引时能够过滤掉大量数据;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;调小 RowGroup Size 增大索引粒度等。&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;section-6&quot;&gt;构建性能优化&lt;/h3&gt;
+&lt;p&gt;#### 1.对 parent dataset 做缓存&lt;br /&gt;
+&lt;img src=&quot;/images/blog/youzan_cn/9 cache_parent_dataset.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;section-7&quot;&gt;2.处理空值导致的数据倾斜&lt;/h4&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan_cn/10 Processing data skew.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;更多关于构建优化的细节内容大家可以参考 &lt;a href=&quot;https://mp.weixin.qq.com/s/T_mK7pTAgk2PXnSJ0lbZ_w&quot;&gt;Kylin 4 最新功能预览 + 优化实践抢先看&lt;/a&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;kylin-4--3&quot;&gt;04 Kylin 4 在有赞的实践&lt;/h2&gt;
+&lt;p&gt;介绍有赞的优化之后,我们再来分享一下优化的效果,也就是 Kylin 4 在有赞的实践包括升级过程以及上线的效果。&lt;/p&gt;
+
+&lt;h3 id=&quot;section-8&quot;&gt;元数据升级&lt;/h3&gt;
+&lt;p&gt;首先是如何升级,我们开发了一个元数据无缝升级的工具,首先我们在 Kylin on HBase 的元数据是保存在 HBase 里的,我们将 HBase 里的元数据以文件的格式导出,再将文件格式的元数据写入到 MySQL,我们也在 Apache Kylin 的官方 wiki 更新了操作文档以及大致的原理,更多详情大家可以参考:&lt;a href=&quot;https://wiki.apache.org/confluence/display/KYLIN/How+to+migrate+metadata+to+Kylin+4&quot;&gt;如何升级元数据到kylin4&lt;/a&gt;.&lt;br /&gt;
+&lt;img src=&quot;/images/blog/youzan_cn/11 metadata_upgrade.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
+我们大致介绍一下整个过程中的一些兼容性,需要迁移的数据大概有六个:前三个是 project 元信息,tables 的元信息,包括一些 Hive 表,还有 model 模型定义的一些元信息,这些是不需要修改的。需要修改的就是 Cube 的元信息。这部分需要修改哪些东西呢?首先是 Cube 所使用的存储和查询的类型,更新完这两个字段之后,需要重新计算一下 Cube 的签名,这个签名的作用是 Kylin 内部设计的避免 Cube 确定�
 �¹‹åŽæˆ‘们再去修改 Cube 导致的一些问题;最后一个是权限相关,这部分也是兼容,无需修改的。&lt;/p&gt;
+
+&lt;h3 id=&quot;kylin-4--4&quot;&gt;Kylin 4 在有赞上线后的表现&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan_cn/12 commodity_insight.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;元数据迁移到 Kylin 之后,我们来分享一下在有赞的一些场景下带来了的质变和大幅度的性能提升。首先像商品洞察这样一个场景,有一个数十万商品的大店铺,我们要去分析它的交易和流量等,有十几个精确去重的计算。精确去重如果没有通过预计算和 Bitmap 去做优化实际上效率是很低的,Kylin 目前使用 Bitmap 去做精确去重的支持。在一个需要对几十万个商品的各种 UV 去做排序的复杂�
 �Ÿ¥è¯¢çš„场景,Kylin 2 的 RT 是 27 秒,而在 Kylin 4 这个场景的 RT 从 27 秒降到了 2 秒以内。&lt;/p&gt;
+
+&lt;p&gt;我觉得 Kylin 4 最吸引我的地方是它完全变成了一个手动档,而 Kylin on HBase 实际上是一个自动档,因为它的并发完全和 region 的数量绑定了。&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan_cn/13 cube_query.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;kylin-4--5&quot;&gt;Kylin 4 在有赞的未来计划&lt;/h3&gt;
+&lt;p&gt;Kylin 4 在有赞的升级大致包含以下几个步骤:&lt;br /&gt;
+&lt;img src=&quot;/images/blog/youzan_cn/14 youzan_plan.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;第一阶段就是调研和可用性测试,因为 Kylin on Parquet 实际上是基于 Spark,是有一定的学习成本的,这个我们也花了一段时间;&lt;/p&gt;
+
+&lt;p&gt;第二阶段就是语法兼容性测试,我们扩展了 Kylin 4 初期不支持的一些语法,比如说分页查询的语法等;&lt;/p&gt;
+
+&lt;p&gt;第三阶段就是流量重放,逐步地上线 Cube 等;&lt;/p&gt;
+
+&lt;p&gt;我们现在是属于第四阶段,我们已经迁移了一些数据了,未来的话,我们会逐步地下线旧集群,然后将所有的业务往新集群上去迁移。&lt;/p&gt;
+
+&lt;p&gt;关于 Kylin 4 我们未来计划开发的功能和满足的需求有赞也会在社区去同步。就不在这里做详细介绍了,大家可以关注我们社区的最新动态,以上就是我们的分享。&lt;/p&gt;
+</description>
+        <pubDate>Thu, 17 Jun 2021 08:00:00 -0700</pubDate>
+        <link>http://kylin.apache.org/cn_blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</link>
+        <guid isPermaLink="true">http://kylin.apache.org/cn_blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</guid>
+        
+        
+        <category>cn_blog</category>
+        
+      </item>
+    
+      <item>
         <title>你离可视化酷炫大屏只差一套 Kylin + Davinci</title>
         <description>&lt;p&gt;Kylin 提供与 BI 工具的整合能力,如 Tableau,PowerBI/Excel,MSTR,QlikSense,Hue 和 SuperSet。但就可视化工具而言,Davinci 良好的交互性和个性化的可视化大屏展现效果,使其与 Kylin 的结合能让大部分用户有更好的可视化分析体验。&lt;/p&gt;
 
@@ -1392,214 +1700,6 @@ Security: (depend on your security setti
         
         
         <category>blog</category>
-        
-      </item>
-    
-      <item>
-        <title>Apache Kylin v3.0.0-alpha 发布</title>
-        <description>&lt;p&gt;近日 Apache Kylin 社区很高兴地宣布,Apache Kylin v3.0.0-alpha 正式发布。&lt;/p&gt;
-
-&lt;p&gt;Apache Kylin 是一个开源的分布式分析引擎,旨在为极大数据集提供 SQL 接口和多维分析(OLAP)的能力。&lt;/p&gt;
-
-&lt;p&gt;这是 Kylin 下一代 v3.x 的第一个发布版本,用于早期预览,主要的功能是实时 (Real-time) OLAP。完整的改动列表请参见&lt;a href=&quot;/docs/release_notes.html&quot;&gt;release notes&lt;/a&gt;;这里挑一些主要改进做说明。&lt;/p&gt;
-
-&lt;h1 id=&quot;section&quot;&gt;重要新功能&lt;/h1&gt;
-
-&lt;h3 id=&quot;kylin-3654----olap&quot;&gt;KYLIN-3654 - 实时 OLAP&lt;/h3&gt;
-&lt;p&gt;随着引入新的 real-time receiver 和 coordinator 组件,Kylin 能够实现毫秒级别的数据准备延迟,数据源来自流式数据如 Apache Kafka。这意味着,从 v3.0 开始,Kylin 既能够支持历史批量数据的 OLAP,也支持对流式数据的准实时(Near real-time)以及完全实时(real-time)分析。用户可以使用一个 OLAP 平台来服务不同的使用场景。此方案已经在早期用户如 eBay 得到部署和验证。关于如何使用此功能,请参考&lt;a href=&quot;/docs30/tutorial/realtime_olap.html&quot;&gt;æ­¤æ•
 ™ç¨‹&lt;/a&gt;。&lt;/p&gt;
-
-&lt;h3 id=&quot;kylin-3795----apache-livy--spark-&quot;&gt;KYLIN-3795 - 通过 Apache Livy 递交 Spark 任务&lt;/h3&gt;
-&lt;p&gt;这个功能允许管理员为 Kylin 配置使用 Apache Livy (incubating) 来完成任务的递交。Spark 作业的提交通过 Livy 的 REST API 来提交,而无需在本地启动 Spark Driver 进程,从而方便对 Spark 资源的管理监控,同时也降低对 Kylin 任务进程所在节点的压力。&lt;/p&gt;
-
-&lt;h3 id=&quot;kylin-3820----curator-&quot;&gt;KYLIN-3820 - 基于 Curator 的任务节点分配和服务发现&lt;/h3&gt;
-&lt;p&gt;新增一种基于Apache Zookeeper 和 Curator作业调度器,可以自动发现 Kylin 节点,并自动分配一个节点来进行任务的管理以及故障恢复。有了这个功能后,管理员可以更加容易地部署和扩展 Kylin 节点,而不再需要在 &lt;code class=&quot;highlighter-rouge&quot;&gt;kylin.properties&lt;/code&gt; 中配置每个 Kylin 节点的地址并重启 Kylin 以使之生效。&lt;/p&gt;
-
-&lt;h1 id=&quot;section-1&quot;&gt;其它改进&lt;/h1&gt;
-
-&lt;h3 id=&quot;kylin-3716---fastthreadlocal--threadlocal&quot;&gt;KYLIN-3716 - FastThreadLocal 替换 ThreadLocal&lt;/h3&gt;
-&lt;p&gt;使用 Netty 中的 FastThreadLocal 替代 JDK 原生的 ThreadLocal,可以一定程度上提升 Kylin 在高并发下的性能。&lt;/p&gt;
-
-&lt;h3 id=&quot;kylin-3867---enable-jdbc-to-use-key-store--trust-store-for-https-connection&quot;&gt;KYLIN-3867 - Enable JDBC to use key store &amp;amp; trust store for https connection&lt;/h3&gt;
-&lt;p&gt;通过使用HTTPS,保护了JDBC使用的身份验证信息,使得Kylin更加安全&lt;/p&gt;
-
-&lt;h3 id=&quot;kylin-3905---enable-shrunken-dictionary-default&quot;&gt;KYLIN-3905 - Enable shrunken dictionary default&lt;/h3&gt;
-&lt;p&gt;默认开启 shrunken dictionary,针对高基维进行精确去重的场景,可以显著减少构建用时。&lt;/p&gt;
-
-&lt;h3 id=&quot;kylin-3839---storage-clean-up-after-the-refreshing-and-deleting-a-segment&quot;&gt;KYLIN-3839 - Storage clean up after the refreshing and deleting a segment&lt;/h3&gt;
-&lt;p&gt;更加及时地清除不必要的数据文件&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;下载&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;要下载Apache Kylin 源代码或二进制包,请访问&lt;a href=&quot;/download&quot;&gt;下载页面&lt;/a&gt; page.&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;升级&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;参考&lt;a href=&quot;/docs/howto/howto_upgrade.html&quot;&gt;升级指南&lt;/a&gt;.&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;反馈&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;如果您遇到问题或疑问,请发送邮件至 Apache Kylin dev 或 user 邮件列表:dev@kylin.apache.org,user@kylin.apache.org; 在发送之前,请确保您已通过发送电子邮件至 dev-subscribe@kylin.apache.org 或 user-subscribe@kylin.apache.org 订阅了邮件列表。&lt;/p&gt;
-
-&lt;p&gt;&lt;em&gt;非常感谢所有贡献Apache Kylin的朋友!&lt;/em&gt;&lt;/p&gt;
-</description>
-        <pubDate>Fri, 19 Apr 2019 13:00:00 -0700</pubDate>
-        <link>http://kylin.apache.org/cn_blog/2019/04/19/release-v3.0.0-alpha/</link>
-        <guid isPermaLink="true">http://kylin.apache.org/cn_blog/2019/04/19/release-v3.0.0-alpha/</guid>
-        
-        
-        <category>cn_blog</category>
-        
-      </item>
-    
-      <item>
-        <title>Real-time Streaming Design in Apache Kylin</title>
-        <description>&lt;h2 id=&quot;why-build-real-time-streaming-in-kylin&quot;&gt;Why Build Real-time Streaming in Kylin&lt;/h2&gt;
-&lt;p&gt;The real-time streaming feature is contributed by eBay big data team in Kylin 3.0, the purpose we build real-time streaming is:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;
-    &lt;p&gt;Milliseconds Data Preparation Delay  &lt;br /&gt;
-Kylin provide sub-second query latency for extremely large dataset, the underly magic is precalculation cube. But the cube building often take long time(usually hours for large data sets),  in some case, the analyst needs real-time data to do analysis, so we want to provide real-time OLAP, which means data can be queried immediately when produced to system.&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;Support Lambda Architecture  &lt;br /&gt;
-Real-time data often not reliable, that may caused by many reasons, for example, the upstream processing system has a bug, or the data need to be changed after some time, etc. So we need to support lambda architecture, which means the cube can be built from the streaming source(like Kafka), and the historical cube data can be refreshed from batch source(like Hive).&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;Less MR jobs and HBase Tables   &lt;br /&gt;
-Since Kylin 1.6, community has provided a streaming solution, it uses MR to consume Kafka data and then do batch cube building, it can provide minute-level data preparation latency, but to ensure the data latency, you need to schedule the MR very shortly(5 minutes or even less), that will cause too many hadoop jobs and small hbase tables in the system, and dramatically increase the Hadoop system’s load.&lt;/p&gt;
-  &lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;architecture&quot;&gt;Architecture&lt;/h2&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/rt_stream_architecture.png&quot; alt=&quot;Kylin RT Streaming Architecture&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;The blue rectangle is streaming components added in current Kylin’s architecture, which is responsible to ingest data from streaming source, and provide query for real-time data.&lt;/p&gt;
-
-&lt;p&gt;We divide the unbounded incoming streaming data into 3 stages, the data come into different stages are all queryable immediately.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/rt_stream_stages.png&quot; alt=&quot;Kylin RT Streaming stages&quot; /&gt;&lt;/p&gt;
-
-&lt;h3 id=&quot;components&quot;&gt;Components&lt;/h3&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/rt_stream_components.png&quot; alt=&quot;Kylin RT Streaming Components&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Streaming Receiver: Responsible to ingest data from stream data source, and provide real-time data query.&lt;/p&gt;
-
-&lt;p&gt;Streaming Coordinator: Responsible to do coordination works, for example, when new streaming cube is onboard, the coordinator need to decide which streaming receivers can be assigned.&lt;/p&gt;
-
-&lt;p&gt;Metadata Store:  Used to store streaming related metadata, for example, the cube assignments information, cube build state information.&lt;/p&gt;
-
-&lt;p&gt;Query Engine:  Extend the existing query engine, support to query real-time data from streaming receiver&lt;/p&gt;
-
-&lt;p&gt;Build Engine: Extend the existing build engine, support to build full cube from the real-time data&lt;/p&gt;
-
-&lt;h3 id=&quot;how-streaming-cube-engine-works&quot;&gt;How Streaming Cube Engine Works&lt;/h3&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/rt_stream_how_build_work.png&quot; alt=&quot;Kylin RT Streaming How Build Works&quot; /&gt;&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Coordinator ask streaming source for all partitions of the cube&lt;/li&gt;
-  &lt;li&gt;Coordinator decide which streaming receivers to assign to consume streaming data, and ask streaming receivers to start consuming data.&lt;/li&gt;
-  &lt;li&gt;Streaming receiver start to consume and index streaming events&lt;/li&gt;
-  &lt;li&gt;After sometime, streaming receiver copy the immutable segments from local files to remote HDFS files&lt;/li&gt;
-  &lt;li&gt;Streaming receiver notify the coordinator that a segment has been persisted to HDFS&lt;/li&gt;
-  &lt;li&gt;Coordinator submit a cube build job to Build Engine to triger cube full building after all receivers have submitted their segments&lt;/li&gt;
-  &lt;li&gt;Build Engine build all cuboids from the streaming HDFS files&lt;/li&gt;
-  &lt;li&gt;Build Engine store cuboid data to Hbase, and then the coordinator will ask the streaming receivers to remove the related local real-time data.&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;h3 id=&quot;how-streaming-query-engine-works&quot;&gt;How Streaming Query Engine Works&lt;/h3&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/rt_stream_how_query_work.png&quot; alt=&quot;Kylin RT Streaming How Query Works&quot; /&gt;&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;If Query hits a streaming cube, Query Engine ask Streaming Coordinator what streaming receivers are assigned for the cube&lt;/li&gt;
-  &lt;li&gt;Query Engine send query request to related streaming receivers to query realtime segments&lt;/li&gt;
-  &lt;li&gt;Query Engine send query request to Hbase to query historical segments&lt;/li&gt;
-  &lt;li&gt;Query Engine aggregate the query results, and send response back to client&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;h2 id=&quot;detail-design&quot;&gt;Detail Design&lt;/h2&gt;
-
-&lt;h3 id=&quot;real-time-segment-store&quot;&gt;Real-time Segment Store&lt;/h3&gt;
-&lt;p&gt;Real-time segments are divided by event time, when new event comes, it will be calculated which segment it will be located, if the segment doesn’t exist, create a new one.&lt;/p&gt;
-
-&lt;p&gt;The new created segment is in ‘Active’ state first, if no further events coming into the segment after some preconfigured period, the segment state will be changed to ‘Immutable’, and then write to remote HDFS.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/rt_stream_rt_segment_state.png&quot; alt=&quot;Kylin RT Streaming Segment State&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Each real-time segment has a memory store, new event will first goes into the memory store to do aggregation, when the memory store size reaches the configured threshold, it will be then be flushed to local disk as a fragment file.&lt;/p&gt;
-
-&lt;p&gt;Not all cuboids are built in the receiver side, only basic cuboid and some specified cuboids are built.&lt;/p&gt;
-
-&lt;p&gt;The data is stored as columnar format on disk, and when there are too many fragments on disk, the fragment files will be merged by a background thread automatically.&lt;/p&gt;
-
-&lt;p&gt;The directory structure in receiver side is like:&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/rt_stream_dir_structure.png&quot; alt=&quot;Kylin RT Streaming Segment Directory&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;To improve the query performance, the data is stored in columnar format, the data format is like:&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/rt_stream_columnar_format.png&quot; alt=&quot;Kylin RT Streaming Columnar Format&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Each cuboid data is stored together, and in each cuboid the data is stored column by column, the metadata is stored in json format.&lt;/p&gt;
-
-&lt;p&gt;The dimension data is divided into 3 parts:&lt;/p&gt;
-
-&lt;p&gt;The first part is Dictionary part, this part exists when the dimension encoding is set to ‘Dict’ in cube design, by default we use &lt;a href=&quot;https://kylin.apache.org/blog/2015/08/13/kylin-dictionary/&quot;&gt;tri-tree dictionary&lt;/a&gt; to minimize the memory footprints and preserve the original order.&lt;/p&gt;
-
-&lt;p&gt;The second part is dictionary encoded values, additional compression mechanism can be applied to these values, since the values for the same column are usually similar, so the compression rate will be very good.&lt;/p&gt;
-
-&lt;p&gt;The third part is invert-index data, use Roaring Bitmap to store the invert-index info, the following picture shows how invert-index data is stored, there are two types of format, the first one is dictionary encoding dimension’s index data format, the second is other fix-len encoding dimension’s index data format.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/rt_stream_invertindex_format.png&quot; alt=&quot;Kylin RT Streaming InvertIndex Format&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Real-time data is stored in compressed format, currently support two type compression: Run Length Encoding and LZ4.&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;Use RLE compression for time-related dim and first dim&lt;/li&gt;
-  &lt;li&gt;Use LZ4 for other dimensions by default&lt;/li&gt;
-  &lt;li&gt;Use LZ4 Compression for simple-type measure(long, double)&lt;/li&gt;
-  &lt;li&gt;No compression for complex measure(count distinct, topn, etc.)&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h3 id=&quot;high-availability&quot;&gt;High Availability&lt;/h3&gt;
-
-&lt;p&gt;Streaming receivers are group into replica-sets, all receivers in the same replica-set  share the same assignments, so that when one receiver is down, the query and event consuming will not be impacted.&lt;/p&gt;
-
-&lt;p&gt;In each replica-set, there is a lead responsible to upload real-time segments to HDFS, and zookeeper is used to do leader election&lt;/p&gt;
-
-&lt;h3 id=&quot;failure-recovery&quot;&gt;Failure Recovery&lt;/h3&gt;
-
-&lt;p&gt;We do checkpoint periodically in receiver side, so that when the receiver is restarted, the data can be restored correctly.&lt;/p&gt;
-
-&lt;p&gt;There are two parts in the checkpoint: the first part is the streaming source consume info, for Kafka it is {partition:offset} pairs, the second part is disk states {segment:framentID} pairs, which means when do the checkpoint what’s the max fragmentID for each segment.&lt;/p&gt;
-
-&lt;p&gt;When receiver is restarted, it will check the latest checkpoint, set the Kafka consumer to start to consume data from specified partition offsets, and remove the fragment files that the fragmentID is larger than the checkpointed fragmentID on the disk.&lt;/p&gt;
-
-&lt;p&gt;Besides the local checkpoint, we also have remote checkpoint, to restore the state when the disk is crashed, the remote checkpoint is saved to Cube Segment metadata after HBase segment build, like:&lt;br /&gt;
-&lt;code class=&quot;highlighter-rouge&quot;&gt;
-    ”segments”:[{…,
-    	 &quot;stream_source_checkpoint&quot;: {&quot;0&quot;:8946898241, “1”: 8193859535, ...}
-                 },
-	]
-&lt;/code&gt;&lt;br /&gt;
-The checkpoint info is the smallest partition offsets on the streaming receiver when real-time segment is sent to full build.&lt;/p&gt;
-
-&lt;h2 id=&quot;future&quot;&gt;Future&lt;/h2&gt;
-&lt;ul&gt;
-  &lt;li&gt;Star Schema Support&lt;/li&gt;
-  &lt;li&gt;Streaming Receiver On Kubernetes/Yarn&lt;/li&gt;
-&lt;/ul&gt;
-</description>
-        <pubDate>Fri, 12 Apr 2019 09:30:00 -0700</pubDate>
-        <link>http://kylin.apache.org/blog/2019/04/12/rt-streaming-design/</link>
-        <guid isPermaLink="true">http://kylin.apache.org/blog/2019/04/12/rt-streaming-design/</guid>
-        
-        
-        <category>blog</category>
         
       </item>
     

Added: kylin/site/images/blog/youzan/1 history_of_youzan_OLAP.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan/1%20history_of_youzan_OLAP.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan/1 history_of_youzan_OLAP.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan/10 commodity_insight.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan/10%20commodity_insight.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan/10 commodity_insight.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan/2 kylin4_storage.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan/2%20kylin4_storage.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan/2 kylin4_storage.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan/3 kylin4_build_engine.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan/3%20kylin4_build_engine.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan/3 kylin4_build_engine.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan/4 kylin4_query.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan/4%20kylin4_query.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan/4 kylin4_query.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan/5 cache_calcite_plan.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan/5%20cache_calcite_plan.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan/5 cache_calcite_plan.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan/6 tuning_spark_configuration.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan/6%20tuning_spark_configuration.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan/6 tuning_spark_configuration.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan/7 parquet_optimization.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan/7%20parquet_optimization.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan/7 parquet_optimization.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan/8 Dynamic_elimination_of_partitioning_dimensions.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan/8%20Dynamic_elimination_of_partitioning_dimensions.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan/8 Dynamic_elimination_of_partitioning_dimensions.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan/9 cache_parent_dataset.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan/9%20cache_parent_dataset.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan/9 cache_parent_dataset.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/1 kylin4_storage.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/1%20kylin4_storage.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/1 kylin4_storage.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/10 Processing data skew.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/10%20Processing%20data%20skew.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/10 Processing data skew.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/11 metadata_upgrade.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/11%20metadata_upgrade.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/11 metadata_upgrade.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/12 commodity_insight.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/12%20commodity_insight.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/12 commodity_insight.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/13 cube_query.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/13%20cube_query.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/13 cube_query.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/14 youzan_plan.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/14%20youzan_plan.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/14 youzan_plan.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/2 kylin4_build_engine.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/2%20kylin4_build_engine.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/2 kylin4_build_engine.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/3 kylin4_query.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/3%20kylin4_query.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/3 kylin4_query.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/4 dynamic_elimination_dimension_partition.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/4%20dynamic_elimination_dimension_partition.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/4 dynamic_elimination_dimension_partition.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/5 Partition clipping under complex filter.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/5%20Partition%20clipping%20under%20complex%20filter.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/5 Partition clipping under complex filter.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/6 tuning_spark_configuration.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/6%20tuning_spark_configuration.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/6 tuning_spark_configuration.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/8 small_query_optimization.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/8%20small_query_optimization.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/8 small_query_optimization.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/youzan_cn/9 cache_parent_dataset.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/youzan_cn/9%20cache_parent_dataset.png?rev=1890886&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/youzan_cn/9 cache_parent_dataset.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: kylin/site/images/docs/quickstart/advance_setting.png
URL: http://svn.apache.org/viewvc/kylin/site/images/docs/quickstart/advance_setting.png?rev=1890886&r1=1890885&r2=1890886&view=diff
==============================================================================
Binary files - no diff available.