You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by xx...@apache.org on 2021/06/18 02:29:18 UTC

[kylin] branch document updated (71c3ce7 -> 2153684)

This is an automated email from the ASF dual-hosted git repository.

xxyu pushed a change to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git.


    from 71c3ce7  Update docs40
     new 91142eb  Add youzan blog
     new 2153684  Update doc4

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 website/_docs40/gettingstarted/quickstart.cn.md    |  26 ++--
 website/_docs40/gettingstarted/quickstart.md       |  28 ++--
 .../howto/howto_build_cube_with_restapi.cn.md      |   3 +
 .../_docs40/howto/howto_build_cube_with_restapi.md |   3 +
 website/_docs40/howto/howto_config_spark_pool.md   |   2 +-
 .../howto/howto_optimize_build_and_query.cn.md     |   5 +-
 .../howto/howto_optimize_build_and_query.md        |   5 +-
 website/_docs40/tutorial/create_cube.cn.md         |  15 +--
 website/_docs40/tutorial/kylin_sample.cn.md        |  24 ----
 website/_docs40/tutorial/kylin_sample.md           |  24 ----
 .../2021-06-17-Why-did-Youzan-choose-Kylin4.cn.md  | 144 +++++++++++++++++++++
 .../2021-06-17-Why-did-Youzan-choose-Kylin4.md     | 136 +++++++++++++++++++
 .../blog/youzan/1 history_of_youzan_OLAP.png       | Bin 0 -> 466850 bytes
 .../images/blog/youzan/10 commodity_insight.png    | Bin 0 -> 426280 bytes
 website/images/blog/youzan/2 kylin4_storage.png    | Bin 0 -> 501891 bytes
 .../images/blog/youzan/3 kylin4_build_engine.png   | Bin 0 -> 281994 bytes
 website/images/blog/youzan/4 kylin4_query.png      | Bin 0 -> 472838 bytes
 .../images/blog/youzan/5 cache_calcite_plan.png    | Bin 0 -> 195334 bytes
 .../blog/youzan/6 tuning_spark_configuration.png   | Bin 0 -> 533283 bytes
 .../images/blog/youzan/7 parquet_optimization.png  | Bin 0 -> 521829 bytes
 ...amic_elimination_of_partitioning_dimensions.png | Bin 0 -> 211173 bytes
 .../images/blog/youzan/9 cache_parent_dataset.png  | Bin 0 -> 345515 bytes
 website/images/blog/youzan_cn/1 kylin4_storage.png | Bin 0 -> 159305 bytes
 .../blog/youzan_cn/10 Processing data skew.png     | Bin 0 -> 125832 bytes
 .../images/blog/youzan_cn/11 metadata_upgrade.png  | Bin 0 -> 129208 bytes
 .../images/blog/youzan_cn/12 commodity_insight.png | Bin 0 -> 160073 bytes
 website/images/blog/youzan_cn/13 cube_query.png    | Bin 0 -> 222507 bytes
 website/images/blog/youzan_cn/14 youzan_plan.png   | Bin 0 -> 67416 bytes
 .../blog/youzan_cn/2 kylin4_build_engine.png       | Bin 0 -> 178576 bytes
 website/images/blog/youzan_cn/3 kylin4_query.png   | Bin 0 -> 145397 bytes
 .../4 dynamic_elimination_dimension_partition.png  | Bin 0 -> 278256 bytes
 .../5 Partition clipping under complex filter.png  | Bin 0 -> 140487 bytes
 .../youzan_cn/6 tuning_spark_configuration.png     | Bin 0 -> 209973 bytes
 .../blog/youzan_cn/8 small_query_optimization.png  | Bin 0 -> 166400 bytes
 .../blog/youzan_cn/9 cache_parent_dataset.png      | Bin 0 -> 83177 bytes
 website/images/docs/quickstart/advance_setting.png | Bin 112356 -> 97288 bytes
 36 files changed, 325 insertions(+), 90 deletions(-)
 delete mode 100644 website/_docs40/tutorial/kylin_sample.cn.md
 delete mode 100644 website/_docs40/tutorial/kylin_sample.md
 create mode 100644 website/_posts/blog/2021-06-17-Why-did-Youzan-choose-Kylin4.cn.md
 create mode 100644 website/_posts/blog/2021-06-17-Why-did-Youzan-choose-Kylin4.md
 create mode 100644 website/images/blog/youzan/1 history_of_youzan_OLAP.png
 create mode 100644 website/images/blog/youzan/10 commodity_insight.png
 create mode 100644 website/images/blog/youzan/2 kylin4_storage.png
 create mode 100644 website/images/blog/youzan/3 kylin4_build_engine.png
 create mode 100644 website/images/blog/youzan/4 kylin4_query.png
 create mode 100644 website/images/blog/youzan/5 cache_calcite_plan.png
 create mode 100644 website/images/blog/youzan/6 tuning_spark_configuration.png
 create mode 100644 website/images/blog/youzan/7 parquet_optimization.png
 create mode 100644 website/images/blog/youzan/8 Dynamic_elimination_of_partitioning_dimensions.png
 create mode 100644 website/images/blog/youzan/9 cache_parent_dataset.png
 create mode 100644 website/images/blog/youzan_cn/1 kylin4_storage.png
 create mode 100644 website/images/blog/youzan_cn/10 Processing data skew.png
 create mode 100644 website/images/blog/youzan_cn/11 metadata_upgrade.png
 create mode 100644 website/images/blog/youzan_cn/12 commodity_insight.png
 create mode 100644 website/images/blog/youzan_cn/13 cube_query.png
 create mode 100644 website/images/blog/youzan_cn/14 youzan_plan.png
 create mode 100644 website/images/blog/youzan_cn/2 kylin4_build_engine.png
 create mode 100644 website/images/blog/youzan_cn/3 kylin4_query.png
 create mode 100644 website/images/blog/youzan_cn/4 dynamic_elimination_dimension_partition.png
 create mode 100644 website/images/blog/youzan_cn/5 Partition clipping under complex filter.png
 create mode 100644 website/images/blog/youzan_cn/6 tuning_spark_configuration.png
 create mode 100644 website/images/blog/youzan_cn/8 small_query_optimization.png
 create mode 100644 website/images/blog/youzan_cn/9 cache_parent_dataset.png

[kylin] 01/02: Add youzan blog

Posted by xx...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

xxyu pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git

commit 91142eb31f0db129b22599c18c0c356dd01a1805
Author: yaqian.zhang <59...@qq.com>
AuthorDate: Thu Jun 17 18:45:45 2021 +0800

    Add youzan blog
---
 .../2021-06-17-Why-did-Youzan-choose-Kylin4.cn.md  | 144 +++++++++++++++++++++
 .../2021-06-17-Why-did-Youzan-choose-Kylin4.md     | 136 +++++++++++++++++++
 .../blog/youzan/1 history_of_youzan_OLAP.png       | Bin 0 -> 466850 bytes
 .../images/blog/youzan/10 commodity_insight.png    | Bin 0 -> 426280 bytes
 website/images/blog/youzan/2 kylin4_storage.png    | Bin 0 -> 501891 bytes
 .../images/blog/youzan/3 kylin4_build_engine.png   | Bin 0 -> 281994 bytes
 website/images/blog/youzan/4 kylin4_query.png      | Bin 0 -> 472838 bytes
 .../images/blog/youzan/5 cache_calcite_plan.png    | Bin 0 -> 195334 bytes
 .../blog/youzan/6 tuning_spark_configuration.png   | Bin 0 -> 533283 bytes
 .../images/blog/youzan/7 parquet_optimization.png  | Bin 0 -> 521829 bytes
 ...amic_elimination_of_partitioning_dimensions.png | Bin 0 -> 211173 bytes
 .../images/blog/youzan/9 cache_parent_dataset.png  | Bin 0 -> 345515 bytes
 website/images/blog/youzan_cn/1 kylin4_storage.png | Bin 0 -> 159305 bytes
 .../blog/youzan_cn/10 Processing data skew.png     | Bin 0 -> 125832 bytes
 .../images/blog/youzan_cn/11 metadata_upgrade.png  | Bin 0 -> 129208 bytes
 .../images/blog/youzan_cn/12 commodity_insight.png | Bin 0 -> 160073 bytes
 website/images/blog/youzan_cn/13 cube_query.png    | Bin 0 -> 222507 bytes
 website/images/blog/youzan_cn/14 youzan_plan.png   | Bin 0 -> 67416 bytes
 .../blog/youzan_cn/2 kylin4_build_engine.png       | Bin 0 -> 178576 bytes
 website/images/blog/youzan_cn/3 kylin4_query.png   | Bin 0 -> 145397 bytes
 .../4 dynamic_elimination_dimension_partition.png  | Bin 0 -> 278256 bytes
 .../5 Partition clipping under complex filter.png  | Bin 0 -> 140487 bytes
 .../youzan_cn/6 tuning_spark_configuration.png     | Bin 0 -> 209973 bytes
 .../blog/youzan_cn/8 small_query_optimization.png  | Bin 0 -> 166400 bytes
 .../blog/youzan_cn/9 cache_parent_dataset.png      | Bin 0 -> 83177 bytes
 25 files changed, 280 insertions(+)

diff --git a/website/_posts/blog/2021-06-17-Why-did-Youzan-choose-Kylin4.cn.md b/website/_posts/blog/2021-06-17-Why-did-Youzan-choose-Kylin4.cn.md
new file mode 100644
index 0000000..11aec43
--- /dev/null
+++ b/website/_posts/blog/2021-06-17-Why-did-Youzan-choose-Kylin4.cn.md
@@ -0,0 +1,144 @@
+---
+layout: post-blog
+title:  有赞为什么选择 Kylin4
+date:   2021-06-17 15:00:00
+author: 郑生俊
+categories: cn_blog
+---
+在 2021年5月29日举办的 QCon 全球软件开发者大会上,来自有赞的数据基础平台负责人 郑生俊 在大数据开源框架与应用专题上分享了有赞内部对 Kylin 4.0 的使用经历和优化实践,对于众多 Kylin 老用户来说,这也是升级 Kylin 4 的实用攻略。
+
+本次分享主要分为以下四个部分:
+
+- 有赞选用 Kylin 4 的原因
+- Kylin 4 原理介绍
+- Kylin 4 性能优化
+- Kylin 4 在有赞的实践
+
+## 01 有赞选用 Kylin 4 的原因
+首先分享有赞为什么会选择升级为 Kylin 4,这里先简单回顾一下有赞 OLAP 的发展历程:有赞初期为了快速迭代,选择了预计算 + MySQL 的方式;2018年,因为查询灵活和开发效率引入了 Druid,但是存在预聚合度不高、不支持精确去重和明细 OLAP 等问题;在这样的背景下,有赞引入了满足聚合度高、支持精确去重和 RT 最低的 Apache Kylin 和查询非常灵活的 ROLAP ClickHouse。
+
+从2018年引入 Kylin 到现在,有赞已经使用 Kylin 三年多了。随着业务场景的不断丰富和数据量的不断积累,有赞目前有 600 万的存量商家,2020年 GMV 是 1073亿,日构建量为 100 亿+,目前 Kylin 已经基本覆盖了有赞所有的业务范围。
+
+随着有赞自身的迅速发展和不断深入地使用 Kylin,我们也遇到一些挑战:
+- 首先 Kylin on HBase 的构建性能无法满足有赞的预期,构建性能会影响到用户的故障恢复时间和稳定性的体验;
+- 其次,随着更多大商家(单店千万级别会员、数十万商品)的接入,对我们的查询也带来了很大的挑战。Kylin on HBase 受限于 QueryServer 单点查询的局限,无法很好地支持这些复杂的场景;
+- 最后,因为 HBase 不是一个云原生系统,很难做到弹性的资源伸缩,随着数据量的不断增长,这个系统对于商家而言,使用时间是存在高峰和低谷的,这就造成平均的资源使用率不够高。
+
+面对这些挑战,有赞选择去向更云原生的 Apache Kylin 4 去靠拢和升级。
+
+## 02 Kylin 4 原理介绍
+首先介绍一下 Kylin 4 的主要优势。Apache Kylin 4 是完全基于 Spark 去做构建和查询的,能够充分地利用 Spark的并行化、向量化和全局动态代码生成等技术,去提高大查询的效率。
+这里从存储、构建和查询三个部分简单介绍一下 Kylin 4 的原理。
+
+### 存储
+![](/images/blog/youzan_cn/1 kylin4_storage.png)
+首先来看一下,Kylin on HBase 和 Kylin on Parquet 的对比。Kylin on HBase 的 Cuboid 的数据是存放在 HBase 的表里,一个 Segment 对应了一张 HBase 表,查询下推的工作由 HBase 协理器处理,因为 HBase 不是真正的列存并且对 OLAP 而言吞吐量不高。Kylin 4 将 HBase 替换为 Parquet,也就是把所有的数据按照文件存储,每个 Segment 会存在一个对应的 HDFS 的目录,所有的查询、构建都是直接通过读写文件的方式,不用再经过 HBase。虽然对于小查询的性能会有一定损失,但对于复杂查询带来的提升是更可观的、更值得的。                                  
+
+### 构建引擎
+![](/images/blog/youzan_cn/2 kylin4_build_engine.png)
+其次是 Kylin 构建引擎,基于有赞的测试,Kylin on Parquet 的构建速度已经从 82 分钟优化到了 15 分钟,有以下几个原因:
+
+- Kylin 4 去掉了维度字典的编码,省去了编码的一个构建步骤;
+- 去掉了 HBase File 的生成步骤;
+- 新版本的 Kylin 4 所有的构建步骤都转换为 Spark 进行构建;
+- Kylin on Parquet 基于 Cuboid 去划分构建粒度,有利于进一步地提升并行度。
+
+可以看到右侧,从十个步骤简化到了两个步骤,构建性能提升的非常明显的。
+
+### 查询引擎
+![](/images/blog/youzan_cn/3 kylin4_query.png)
+
+接下来就是 Kylin 4 的查询,大家可以看到,左边这列 Kylin on HBase 的计算是完全依托于 Calcite 和 HBase 的协处理器,这就导致当数据从 HBase 读取后,如果想做聚合、排序等,就会局限于 QueryServer 单点的瓶颈,而 Kylin 4 则转换为基于 Spark DataFrame 的全分布式的查询机制。
+
+## 03 Kylin 4 性能优化
+接下来分享有赞在 Kylin 4 所做的一些性能优化。
+
+### 查询性能优化
+#### 1.动态消除维度分区
+![](/images/blog/youzan_cn/4 dynamic_elimination_dimension_partition.png)
+
+首先我们来看一个场景,我们做到了动态消除分区维度,混合使用 cuboid 来对复杂查询,减少数十倍的计算量。
+
+这里举一个例子,在一个 Cube 有三个 Segment 的情况下,Cube 分区字段记作 P,它有三个 Segment 分别是1月1日到2月1日、2月1日到3月1日,3月1日到3月7日。假设有一个SQL,Select count(a) from test where p >= 20200101 and p <= 20200313 group by a。
+
+在这种情况下,因为需要分区过滤,Kylin 它会选择 a 和 p 预计算维度的组合,转换成执行计划就是最上层的 Aggregate 然后 Filter,最后会转换成一个 TableScan,这个 TableScan 就是选择聚合维度为 a 和 p 这样的一个维度组合。实际上这个查询计划是适合把它优化成右边这种方式的,对于某个 Segment 完全使用到的数据,我们可以选择一个 Cuboid 为 a 的 Cuboid 去做查询。对于部分用到的分区或者 Segment,我们可以选择 a 和 p 这样的一个维度组合。通过这种方式,在 a 只有一个可能值的情况下,之前可能要 scan 65 条数据,优化后只要 scan 8 条数据。假设时间跨度更长,比如说跨几个月、半年甚至一年,就会减少数十倍、几十倍的计算量和 IO。
+
+在有赞某些场景,RT 可以从 10 秒优化到 3 秒、20s 提升到 6s,对于更复杂的场景(比如计算密集型的 HLL),会有更显著的优化效果。这部分优化,有赞也正打算贡献回社区。因为涉及到如何在多层嵌套和复杂的条件下进行 segment 分组,以及目前 calcite 和 spark catalyst 并存,实现上会比较复杂。到时候大家在 Kylin 4.0-GA 版本可能就可以看到这个优化了。
+
+#### 2.复杂过滤条件下的分区裁剪
+接下来再介绍一下有赞所做的查询性能优化,就是支持复杂过滤条件下的分区裁剪。目前 Kylin 4.0 Beta 版本对于复杂的过滤条件比如多个过滤字段、多层嵌套的 Filter 等,不支持分区裁剪,导致全表扫描。我们做了一个优化,是将复杂的嵌套 Filter 过滤的语法树转换成基于分区字段 p 的一个等价表达式,然后再将这个表达式应用到每一个 Segment 去做过滤,通过这样的方式,去支持它做到一个非常复杂的分区过滤裁剪。
+
+![](/images/blog/youzan_cn/5 Partition clipping under complex filter.png)
+
+#### 3.Spark 参数调优
+![](/images/blog/youzan_cn/6 tuning_spark_configuration.png)
+
+接下来是比较重要的一部分,就是关于 Spark 的调参。Spark 是一个分布式计算框架,相比 Calcite 而言,对于小查询是存在一定劣势的。
+
+首先我们做了一个调整,尽量让 Spark 所有的计算操作是在内存中完成的。以下两种情况会产生 spill:
+- 01 在聚合时,在我们内存不够的时候,Spark 会将 HashAggregate 转换为 Sort Based Aggregate,实际上这一步是很耗性能的。我们通过调大阈值的参数,尽量让所有的聚合都在内存中完成。
+- 02 在 shuffle 的过程中,Spark是不可避免地会进行 Spill,会落盘,我们能做的尽量在 Shuffle 过程减少 Spill,只在最后 Shuffle 结束之后进行 Spill。
+
+第二个我们做的调优是,相比 on YARN/Standalone 模式下,local 模式大部分都是在进程内通信的,也不需要产生跨网络的 Shuffle, broadcast 广播变量也不需要跨网络,所以对于小查询,我们会路由到以 Local 模式运行的 Spark Application,这对于小查询非常有意义。
+
+第三个优化是 shuffle 使用内存盘。因为内存盘肯定是最快的,我们将内存盘挂载为 tmpfs 文件系统,然后将 spark.local.dir 指定为挂载的内存盘去优化 shuffle 的速度和吞吐。
+
+第四个优化是我们关闭 Spark 全局动态代码生成。Spark 的全局动态代码生成是要在运行的时间内去动态拼接代码,再去动态编译代码,这个过程实际上是很耗时的。对于离线的大数据量下是很有优化意义,但是对于比较小的一些数据场景,我们关掉这个动态代码生成之后,能够节省大概 100 到 200 毫秒的耗时。
+
+目前经过上述一系列的优化,我们能让小查询的 RT 稳定在大概 300 毫秒左右,尽管 HBase 可能是几十毫秒左右的 RT,但我们认为目前已经比较接近了,这种为提升大查询提升的 Tradeoff 我们认为是一个很值得的事情。
+
+#### 4.小查询优化
+![](/images/blog/youzan_cn/8 small_query_optimization.png)
+然后,我来分享一下小查询的优化。Kylin on HBase 依托于 HBase 能够做到几十毫秒的 RT,因为 HBase 有 bucket cache 缓存。而 Kylin on Parquet 就完全基于文件的读取和计算,缓存依赖于文件系统的 page cache,那么它小查询的 RT 会比 HBase 更高一些,我们能做的就是尽量缩小 Kylin on Parquet 和 Kylin on HBase 的 RT 差距。
+
+经过我们的分析,SQL 会通过 Calcite 解析成 Calcite 语法树,然后将这个语法树转化为 Spark DataFrame,最终再将整个查询交给 Spark 去执行。在这一步的过程中,SQL 转化成 Calcite 的过程中,是需要经过语法解析、优化等,这一步大概会消耗 150 毫秒左右。有赞做的是尽量使用结构化的 SQL,就是 PreparedStatement,我们在 Kylin 中支持 PreparedStatementCache,对于固定的 SQL 格式,将它的执行计划进行缓存,去重用这样的执行计划,降低该步骤的时间消耗,通过这样的优化,可以降低大概 100 毫秒左右的耗时。
+
+#### 5.Parquet 优化
+
+关于查询性能的优化,有赞还充分利用了 Parquet 索引,优化建议包括: 
+
+- Parquet 文件首先根据 Shard By Column 进行分组,过滤条件尽量包含 Shard By Column;
+
+- Parquet 中的数据依然按照维度排序,结合 Column MetaData 中的 Max、Min 索引,在命中前缀索引时能够过滤掉大量数据;
+
+- 调小 RowGroup Size 增大索引粒度等。
+
+### 构建性能优化
+#### 1.对 parent dataset 做缓存
+![](/images/blog/youzan_cn/9 cache_parent_dataset.png)
+
+#### 2.处理空值导致的数据倾斜
+![](/images/blog/youzan_cn/10 Processing data skew.png)
+
+更多关于构建优化的细节内容大家可以参考 [Kylin 4 最新功能预览 + 优化实践抢先看](https://mp.weixin.qq.com/s/T_mK7pTAgk2PXnSJ0lbZ_w)
+
+## 04 Kylin 4 在有赞的实践
+介绍有赞的优化之后,我们再来分享一下优化的效果,也就是 Kylin 4 在有赞的实践包括升级过程以及上线的效果。
+
+### 元数据升级
+首先是如何升级,我们开发了一个元数据无缝升级的工具,首先我们在 Kylin on HBase 的元数据是保存在 HBase 里的,我们将 HBase 里的元数据以文件的格式导出,再将文件格式的元数据写入到 MySQL,我们也在 Apache Kylin 的官方 wiki 更新了操作文档以及大致的原理,更多详情大家可以参考:[如何升级元数据到kylin4](https://wiki.apache.org/confluence/display/KYLIN/How+to+migrate+metadata+to+Kylin+4).
+![](/images/blog/youzan_cn/11 metadata_upgrade.png)
+我们大致介绍一下整个过程中的一些兼容性,需要迁移的数据大概有六个:前三个是 project 元信息,tables 的元信息,包括一些 Hive 表,还有 model 模型定义的一些元信息,这些是不需要修改的。需要修改的就是 Cube 的元信息。这部分需要修改哪些东西呢?首先是 Cube 所使用的存储和查询的类型,更新完这两个字段之后,需要重新计算一下 Cube 的签名,这个签名的作用是 Kylin 内部设计的避免 Cube 确定之后我们再去修改 Cube 导致的一些问题;最后一个是权限相关,这部分也是兼容,无需修改的。
+
+### Kylin 4 在有赞上线后的表现
+![](/images/blog/youzan_cn/12 commodity_insight.png)
+
+元数据迁移到 Kylin 之后,我们来分享一下在有赞的一些场景下带来了的质变和大幅度的性能提升。首先像商品洞察这样一个场景,有一个数十万商品的大店铺,我们要去分析它的交易和流量等,有十几个精确去重的计算。精确去重如果没有通过预计算和 Bitmap 去做优化实际上效率是很低的,Kylin 目前使用 Bitmap 去做精确去重的支持。在一个需要对几十万个商品的各种 UV 去做排序的复杂查询的场景,Kylin 2 的 RT 是 27 秒,而在 Kylin 4 这个场景的 RT 从 27 秒降到了 2 秒以内。
+
+我觉得 Kylin 4 最吸引我的地方是它完全变成了一个手动档,而 Kylin on HBase 实际上是一个自动档,因为它的并发完全和 region 的数量绑定了。
+
+![](/images/blog/youzan_cn/13 cube_query.png)
+
+### Kylin 4 在有赞的未来计划
+Kylin 4 在有赞的升级大致包含以下几个步骤:
+![](/images/blog/youzan_cn/14 youzan_plan.png)
+
+第一阶段就是调研和可用性测试,因为 Kylin on Parquet 实际上是基于 Spark,是有一定的学习成本的,这个我们也花了一段时间;
+
+第二阶段就是语法兼容性测试,我们扩展了 Kylin 4 初期不支持的一些语法,比如说分页查询的语法等;
+
+第三阶段就是流量重放,逐步地上线 Cube 等;
+
+我们现在是属于第四阶段,我们已经迁移了一些数据了,未来的话,我们会逐步地下线旧集群,然后将所有的业务往新集群上去迁移。
+
+关于 Kylin 4 我们未来计划开发的功能和满足的需求有赞也会在社区去同步。就不在这里做详细介绍了,大家可以关注我们社区的最新动态,以上就是我们的分享。
\ No newline at end of file
diff --git a/website/_posts/blog/2021-06-17-Why-did-Youzan-choose-Kylin4.md b/website/_posts/blog/2021-06-17-Why-did-Youzan-choose-Kylin4.md
new file mode 100644
index 0000000..03f9ca1
--- /dev/null
+++ b/website/_posts/blog/2021-06-17-Why-did-Youzan-choose-Kylin4.md
@@ -0,0 +1,136 @@
+---
+layout: post-blog
+title:  Why did Youzan choose Kylin4
+date:   2021-06-17 15:00:00
+author: Zheng Shengjun
+categories: blog
+---
+At the QCon Global Software Developers Conference held on May 29, 2021, Zheng Shengjun, head of Youzan's data infrastructure platform, shared Youzan's internal use experience and optimization practice of Kylin 4.0 on the meeting room of open source big data frameworks and applications. 
+For many users of Kylin2/3(Kylin on HBase), this is also a chance to learn how and why to upgrade to Kylin 4. 
+
+This sharing is mainly divided into the following parts:
+
+- The reason for choosing Kylin 4
+- Introduction to Kylin 4
+- How to optimize performance of Kylin 4
+- Practice of Kylin 4 in Youzan
+
+## 01 The reason for choosing Kylin 4
+
+### Introduction to Youzan
+China Youzan Co., Ltd (stock code 08083.HK). is an enterprise mainly engaged in retail technology services.
+At present, it owns several tools and solutions to provide SaaS software products and talent services to help merchants operate mobile social e-commerce and new retail channels in an all-round way. 
+Currently Youzan has hundreds of millions of consumers and 6 million existing merchants.
+
+### History of Kylin in Youzan
+![](/images/blog/youzan/1 history_of_youzan_OLAP.png)
+
+First of all, I would like to share why Youzan chose to upgrade to Kylin 4. Here, let me briefly reviewed the history of Youzan OLAP infra.
+
+In the early days of Youzan, in order to iterate develop process quickly, we chose the method of pre-computation + MySQL; in 2018, Druid was introduced because of query flexibility and development efficiency, but there were problems such as low pre-aggregation, not supporting precisely count distinct measure. In this situation, Youzan introduced Apache Kylin and ClickHouse. Kylin supports high aggregation, precisely count distinct measure and the lowest RT, while ClickHouse is quite flex [...]
+
+From the introduction of Kylin in 2018 to now, Youzan has used Kylin for more than three years. With the continuous enrichment of business scenarios and the continuous accumulation of data volume, Youzan currently has 6 million existing merchants, GMV in 2020 is 107.3 billion, and the daily build data volume is 10 billion +. At present, Kylin has basically covered all the business scenarios of Youzan.
+
+### The challenges of Kylin 3
+With Youzan's rapid development and in-depth use of Kylin, we also encountered some challenges:
+
+- First of all, the build performance of Kylin on HBase cannot meet the favorable expectations, and the build performance will affect the user's failure recovery time and stability experience;
+- Secondly, with the access of more large merchants (tens of millions of members in a single store, with hundreds of thousands of goods for each store), it also brings great challenges to our OLAP system. Kylin on HBase is limited by the single-point query of Query Server, and cannot support these complex scenarios well;
+- Finally, because HBase is not a cloud-native system, it is difficult to achieve flexible scale up and scale down. With the continuous growth of data volume, this system has peaks and valleys for businesses, which results in the average resource utilization rate is not high enough.
+
+Faced with these challenges, Youzan chose to move closer and upgrade to the more cloud-native Apache Kylin 4.
+
+## 02 Introduction to Kylin 4
+First of all, let's introduce the main advantages of Kylin 4. Apache Kylin 4 completely depends on Spark for cubing job and query. It can make full use of Spark's parallelization, quantization(向量化), and global dynamic code generation technologies to improve the efficiency of large queries.
+Here is a brief introduction to the principle of Kylin 4, that is storage engine, build engine and query engine.
+
+### Storage engine
+![](/images/blog/youzan/2 kylin4_storage.png)
+
+First of all, let's take a look at the new storage engine, comparison between Kylin on HBase and Kylin on Parquet. The cuboid data of Kylin on HBase is stored in the table of HBase. Single Segment corresponds to one HBase table. Aggregation is pushed down to HBase coprocessor.
+
+But as we know,  HBase is not a real Columnar Storage and its throughput is not enough for OLAP System. Kylin 4 replaces HBase with Parquet, all the data is stored in files. Each segment will have a corresponding HDFS directory. All queries and cubing jobs read and write files without HBase . Although there will be a certain loss of performance for simple queries, the improvement brought about by complex queries is more considerable and worthwhile.
+
+### Build engine
+![](/images/blog/youzan/3 kylin4_build_engine.png)
+
+The second is the new build engine. Based on our test, the build speed of Kylin on Parquet has been optimized from 82 minutes to 15 minutes. There are several reasons:
+
+- Kylin 4 removes the encoding of the dimension, eliminating a building step of encoding;
+- Removed the HBase File generation step;
+- Kylin on Parquet changes the granularity of cubing to cuboid level, which is conducive to further improving parallelism of cubing job.
+- Enhanced implementation for global dictionary. In the new algorithm, dictionary and source data are hashed into the same buckets, making it possible for loading only piece of dictionary bucket to encode source data.
+
+As you can see on the right, after upgradation to Kylin 4, cubing job changes from ten steps to two steps, the performance improvement of the construction is very obvious.
+
+### Query engine
+![](/images/blog/youzan/4 kylin4_query.png)
+
+Next is the new query engine of Kylin 4. As you can see, the calculation of Kylin on HBase is completely dependent on the coprocessor of HBase and query server process. When the data is read from HBase into query server to do aggregation, sorting, etc, the bottleneck will be restricted by the single point of query server. But Kylin 4 is converted to a fully distributed query mechanism based on Spark, what's more, it 's able to do configuration tuning automatically in spark query step ! 
+
+## 03 How to optimize performance of Kylin 4
+Next, I'd like to share some performance optimizations made by Youzan in Kylin 4.
+
+### Optimization of query engine
+#### 1.Cache Calcite physical plan
+![](/images/blog/youzan/5 cache_calcite_plan.png)
+
+In Kylin4, SQL will be analyzed, optimized and do code generation in calcite. This step takes up about 150ms for some queries. We have supported PreparedStatementCache in Kylin4 to cache calcite plan, so that the structured SQL don't have to do the same step again. With this optimization it saved about 150ms of time cost.
+
+#### 2.Tunning spark configuration
+![](/images/blog/youzan/6 tuning_spark_configuration.png)
+
+Kylin4 uses spark as query engine. As spark is a distributed engine designed for massive data processing, it's inevitable to loose some performance for small queries. We have tried to do some tuning to catch up with the latency in Kylin on HBase for small queries.
+
+Our first optimization is to make more calculations finish in memory. The key is to avoid data spill during aggregation, shuffle and sort. Tuning the following configuration is helpful.
+
+- 1.set `spark.sql.objectHashAggregate.sortBased.fallbackThreshold` to larger value to avoid HashAggregate fall back to Sort Based Aggregate, which really kills performance when happens.
+- 2.set `spark.shuffle.spill.initialMemoryThreshold` to a large value to avoid to many spills during shuffle.
+
+Secondly, we route small queries to Query Server which run spark in local mode. Because the overhead of task schedule, shuffle read and variable broadcast is enlarged for small queries on YARN/Standalone mode.
+
+Thirdly, we use RAM disk to enhance shuffle performance. Mount RAM disk as TMPFS and set spark.local.dir to directory using RAM disk.
+
+Lastly, we disabled spark's whole stage code generation for small queries, for spark's whole stage code generation will cost about 100ms~200ms, whereas it's not beneficial to small queries which is a simple project.
+
+#### 3.Parquet optimization
+![](/images/blog/youzan/7 parquet_optimization.png)
+
+Optimizing parquet is also important for queries.
+
+The first principal is that we'd better always include shard by column in our filter condition, for parquet files are shard by shard-by-column, filter using shard by column reduces the data files to read.
+
+Then look into parquet files, data within files are sorted by rowkey columns, that is to say, prefix match in query is as important as Kylin on HBase. When a query condition satisfies prefix match, it can filter row groups with column's max/min index. Furthermore, we can reduce row group size to make finer index granularity, but be aware that the compression rate will be lower if we set row group size smaller.
+
+#### 4.Dynamic elimination of partitioning dimensions
+Kylin4 have a new ability that the older version is not capable of, which is able to reduce dozens of times of data reading and computing for some big queries. It's offen the case that partition column is used to filter data but not used as group dimension. For those cases Kylin would always choose cuboid with partition column, but now it is able to use different cuboid in that query to reduce IO read and computing.
+
+The key of this optimization is to split a query into two parts, one of the part uses all segment's data so that partition column doesn't have to be included in cuboid, the other part that uses part of segments data will choose cuboid with partition dimension to do the data filter.
+
+We have tested that in some situations the response time reduced from 20s to 6s, 10s to 3s.
+
+![](/images/blog/youzan/8 Dynamic_elimination_of_partitioning_dimensions.png)
+
+### Optimization of build engine
+#### 1.cache parent dataset
+![](/images/blog/youzan/9 cache_parent_dataset.png)
+
+Kylin build cube layer by layer. For a parent layer with multi cuboids to build, we can choose to cache parent dataset by setting kylin.engine.spark.parent-dataset.max.persist.count to a number greater than 0. But notice that if you set this value too small, it will affect the parallelism of build job, as the build granularity is at cuboid level.
+
+## 04 Practice of Kylin 4 in Youzan
+After introducing Youzan's experience of performance optimization, let's share the optimization effect. That is, Kylin 4's practice in Youzan includes the upgrade process and the performance of online system.
+
+### Upgrade metadata to adapt to Kylin 4
+First of all, for metadata for Kylin 3 which stored on HBase, we have developed a tool for seamless upgrading of metadata. First of all, our metadata in Kylin on HBase is stored in HBase. We export the metadata in HBase into local files, and then use tools to transform and write back the new metadata into MySQL. We also updated the operation documents and general principles in the official wiki of Apache Kylin. For more details, you can refer to: [How to migrate metadata to Kylin 4](http [...]
+
+Let's give a general introduction to some compatibility in the whole process. The project metadata, tables metadata, permission-related metadata, and model metadata do not need be modified. What needs to be modified is the cube metadata, including the type of storage and query used by Cube. After updating these two fields, you need to recalculate the Cube signature. The function of this signature is designed internally by Kylin to avoid some problems caused by Cube after Cube is determined.
+
+### Performance of Kylin 4 on Youzan online system
+![](/images/blog/youzan/10 commodity_insight.png)
+
+After the migration of metadata to Kylin4, let's share the qualitative changes and substantial performance improvements brought about by some of the promising scenarios. First of all, in a scenario like Commodity Insight, there is a large store with several hundred thousand of commodities. We have to analyze its transactions and traffic, etc. There are more than a dozen precise precisely count distinct measures in single cube. Precisely count distinct measure is actually very inefficient [...]
+
+What I find most appealing to me about Kylin 4 is that it's like a manual transmission car, you can control its query concurrency at your will, whereas you can't change query concurrency in Kylin on HBase freely, because its concurrency is completely tied to the number of regions.
+
+### Plan for Kylin 4 in Youzan
+We have made full test, fixed several bugs and improved apache KYLIN4 for several months. Now we are migrating cubes from older version to newer version. For the cubes already migrated to KYLIN4, its small queries' performance meet our expectations, its complex query and build performance did bring us a big surprise. We are planning to migrate all cubes from older version to Kylin4.
\ No newline at end of file
diff --git a/website/images/blog/youzan/1 history_of_youzan_OLAP.png b/website/images/blog/youzan/1 history_of_youzan_OLAP.png
new file mode 100644
index 0000000..c6833c4
Binary files /dev/null and b/website/images/blog/youzan/1 history_of_youzan_OLAP.png differ
diff --git a/website/images/blog/youzan/10 commodity_insight.png b/website/images/blog/youzan/10 commodity_insight.png
new file mode 100644
index 0000000..c2d55cc
Binary files /dev/null and b/website/images/blog/youzan/10 commodity_insight.png differ
diff --git a/website/images/blog/youzan/2 kylin4_storage.png b/website/images/blog/youzan/2 kylin4_storage.png
new file mode 100644
index 0000000..f055682
Binary files /dev/null and b/website/images/blog/youzan/2 kylin4_storage.png differ
diff --git a/website/images/blog/youzan/3 kylin4_build_engine.png b/website/images/blog/youzan/3 kylin4_build_engine.png
new file mode 100644
index 0000000..c8562ae
Binary files /dev/null and b/website/images/blog/youzan/3 kylin4_build_engine.png differ
diff --git a/website/images/blog/youzan/4 kylin4_query.png b/website/images/blog/youzan/4 kylin4_query.png
new file mode 100644
index 0000000..847e8c0
Binary files /dev/null and b/website/images/blog/youzan/4 kylin4_query.png differ
diff --git a/website/images/blog/youzan/5 cache_calcite_plan.png b/website/images/blog/youzan/5 cache_calcite_plan.png
new file mode 100644
index 0000000..7423a25
Binary files /dev/null and b/website/images/blog/youzan/5 cache_calcite_plan.png differ
diff --git a/website/images/blog/youzan/6 tuning_spark_configuration.png b/website/images/blog/youzan/6 tuning_spark_configuration.png
new file mode 100644
index 0000000..df631bd
Binary files /dev/null and b/website/images/blog/youzan/6 tuning_spark_configuration.png differ
diff --git a/website/images/blog/youzan/7 parquet_optimization.png b/website/images/blog/youzan/7 parquet_optimization.png
new file mode 100644
index 0000000..ad323ed
Binary files /dev/null and b/website/images/blog/youzan/7 parquet_optimization.png differ
diff --git a/website/images/blog/youzan/8 Dynamic_elimination_of_partitioning_dimensions.png b/website/images/blog/youzan/8 Dynamic_elimination_of_partitioning_dimensions.png
new file mode 100644
index 0000000..5d7ba4f
Binary files /dev/null and b/website/images/blog/youzan/8 Dynamic_elimination_of_partitioning_dimensions.png differ
diff --git a/website/images/blog/youzan/9 cache_parent_dataset.png b/website/images/blog/youzan/9 cache_parent_dataset.png
new file mode 100644
index 0000000..e37f2e3
Binary files /dev/null and b/website/images/blog/youzan/9 cache_parent_dataset.png differ
diff --git a/website/images/blog/youzan_cn/1 kylin4_storage.png b/website/images/blog/youzan_cn/1 kylin4_storage.png
new file mode 100644
index 0000000..f78bb80
Binary files /dev/null and b/website/images/blog/youzan_cn/1 kylin4_storage.png differ
diff --git a/website/images/blog/youzan_cn/10 Processing data skew.png b/website/images/blog/youzan_cn/10 Processing data skew.png
new file mode 100644
index 0000000..006805e
Binary files /dev/null and b/website/images/blog/youzan_cn/10 Processing data skew.png differ
diff --git a/website/images/blog/youzan_cn/11 metadata_upgrade.png b/website/images/blog/youzan_cn/11 metadata_upgrade.png
new file mode 100644
index 0000000..1c2064b
Binary files /dev/null and b/website/images/blog/youzan_cn/11 metadata_upgrade.png differ
diff --git a/website/images/blog/youzan_cn/12 commodity_insight.png b/website/images/blog/youzan_cn/12 commodity_insight.png
new file mode 100644
index 0000000..c834b52
Binary files /dev/null and b/website/images/blog/youzan_cn/12 commodity_insight.png differ
diff --git a/website/images/blog/youzan_cn/13 cube_query.png b/website/images/blog/youzan_cn/13 cube_query.png
new file mode 100644
index 0000000..990ed5c
Binary files /dev/null and b/website/images/blog/youzan_cn/13 cube_query.png differ
diff --git a/website/images/blog/youzan_cn/14 youzan_plan.png b/website/images/blog/youzan_cn/14 youzan_plan.png
new file mode 100644
index 0000000..9829b36
Binary files /dev/null and b/website/images/blog/youzan_cn/14 youzan_plan.png differ
diff --git a/website/images/blog/youzan_cn/2 kylin4_build_engine.png b/website/images/blog/youzan_cn/2 kylin4_build_engine.png
new file mode 100644
index 0000000..3115424
Binary files /dev/null and b/website/images/blog/youzan_cn/2 kylin4_build_engine.png differ
diff --git a/website/images/blog/youzan_cn/3 kylin4_query.png b/website/images/blog/youzan_cn/3 kylin4_query.png
new file mode 100644
index 0000000..db1f419
Binary files /dev/null and b/website/images/blog/youzan_cn/3 kylin4_query.png differ
diff --git a/website/images/blog/youzan_cn/4 dynamic_elimination_dimension_partition.png b/website/images/blog/youzan_cn/4 dynamic_elimination_dimension_partition.png
new file mode 100644
index 0000000..cdc3f79
Binary files /dev/null and b/website/images/blog/youzan_cn/4 dynamic_elimination_dimension_partition.png differ
diff --git a/website/images/blog/youzan_cn/5 Partition clipping under complex filter.png b/website/images/blog/youzan_cn/5 Partition clipping under complex filter.png
new file mode 100644
index 0000000..e69017c
Binary files /dev/null and b/website/images/blog/youzan_cn/5 Partition clipping under complex filter.png differ
diff --git a/website/images/blog/youzan_cn/6 tuning_spark_configuration.png b/website/images/blog/youzan_cn/6 tuning_spark_configuration.png
new file mode 100644
index 0000000..9433326
Binary files /dev/null and b/website/images/blog/youzan_cn/6 tuning_spark_configuration.png differ
diff --git a/website/images/blog/youzan_cn/8 small_query_optimization.png b/website/images/blog/youzan_cn/8 small_query_optimization.png
new file mode 100644
index 0000000..ce56af9
Binary files /dev/null and b/website/images/blog/youzan_cn/8 small_query_optimization.png differ
diff --git a/website/images/blog/youzan_cn/9 cache_parent_dataset.png b/website/images/blog/youzan_cn/9 cache_parent_dataset.png
new file mode 100644
index 0000000..e14c657
Binary files /dev/null and b/website/images/blog/youzan_cn/9 cache_parent_dataset.png differ

[kylin] 02/02: Update doc4

Posted by xx...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

xxyu pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git

commit 2153684792feaa1c4794f24cbca7e94f02b0f269
Author: yaqian.zhang <59...@qq.com>
AuthorDate: Thu Jun 17 18:50:05 2021 +0800

    Update doc4
---
 website/_docs40/gettingstarted/quickstart.cn.md    |  26 ++++++++++---------
 website/_docs40/gettingstarted/quickstart.md       |  28 ++++++++++-----------
 .../howto/howto_build_cube_with_restapi.cn.md      |   3 +++
 .../_docs40/howto/howto_build_cube_with_restapi.md |   3 +++
 website/_docs40/howto/howto_config_spark_pool.md   |   2 +-
 .../howto/howto_optimize_build_and_query.cn.md     |   5 +++-
 .../howto/howto_optimize_build_and_query.md        |   5 ++--
 website/_docs40/tutorial/create_cube.cn.md         |  15 +++--------
 website/_docs40/tutorial/kylin_sample.cn.md        |  24 ------------------
 website/_docs40/tutorial/kylin_sample.md           |  24 ------------------
 website/images/docs/quickstart/advance_setting.png | Bin 112356 -> 97288 bytes
 11 files changed, 45 insertions(+), 90 deletions(-)

diff --git a/website/_docs40/gettingstarted/quickstart.cn.md b/website/_docs40/gettingstarted/quickstart.cn.md
index 4ff0585..e840919 100644
--- a/website/_docs40/gettingstarted/quickstart.cn.md
+++ b/website/_docs40/gettingstarted/quickstart.cn.md
@@ -26,19 +26,19 @@ CentOS 6.5+ 或Ubuntu 16.0.4+
 - 软件要求:
   - Hadoop 2.7+,3.0
   - Hive 0.13+,1.2.1+
+  - Spark 2.4.6
   - JDK: 1.8+
 
-建议使用集成的Hadoop环境进行kylin的安装与测试,比如Hortonworks HDP 或Cloudera CDH ,kylin发布前在 Hortonworks HDP 2.2-2.6 and 3.0, Cloudera CDH 5.7-5.11 and 6.0, AWS EMR 5.7-5.10, Azure HDInsight 3.5-3.6上测试通过。 
+建议使用集成的Hadoop环境进行kylin的安装与测试,比如Hortonworks HDP 或Cloudera CDH ,kylin发布前在 Hortonworks HDP 2.4, Cloudera CDH 5.7 and 6.0, AWS EMR 5.31 and 6.0, Azure HDInsight 4.0 上测试通过。 
 
 当你的环境满足上述前置条件时 ,你可以开始安装使用kylin。
 
 #### step1、下载kylin压缩包
 
-从[Apache Kylin Download Site](https://kylin.apache.org/download/)下载一个适用于你的Hadoop版本的二进制文件。目前最新Release版本是kylin 3.1.0和kylin 2.6.6,其中3.0版本支持实时摄入数据进行预计算的功能。以CDH 5.的hadoop环境为例,可以使用如下命令行下载kylin 3.1.0:
-
+从[Apache Kylin Download Site](https://kylin.apache.org/download/)下载 kylin4.0 的二进制文件。
 ```
 cd /usr/local/
-wget http://apache.website-solution.net/kylin/apache-kylin-3.1.0/apache-kylin-3.1.0-bin-cdh57.tar.gz
+wget http://apache.website-solution.net/kylin/apache-kylin-4.0.0/apache-kylin-4.0.0-bin.tar.gz
 ```
 
 #### step2、解压kylin
@@ -46,14 +46,14 @@ wget http://apache.website-solution.net/kylin/apache-kylin-3.1.0/apache-kylin-3.
 解压下载得到的kylin压缩包,并配置环境变量KYLIN_HOME指向解压目录:
 
 ```
-tar -zxvf  apache-kylin-3.1.0-bin-cdh57.tar.gz
-cd apache-kylin-3.1.0-bin-cdh57
+tar -zxvf  apache-kylin-4.0.0-bin.tar.gz
+cd apache-kylin-4.0.0-bin-cdh57
 export KYLIN_HOME=`pwd`
 ```
 
 #### step3、下载SPARK
 
-由于kylin启动时会对SPARK环境进行检查,所以你需要设置SPARK_HOME指向自己的spark安装路径:
+Kylin4.0 使用 Spark 作为查询和构建引擎,所以你需要设置SPARK_HOME指向自己的spark安装路径:
 
 ```
 export SPARK_HOME=/path/to/spark
@@ -100,7 +100,7 @@ $KYLIN_HOME/bin/kylin.sh start
 
 ```
 A new Kylin instance is started by root. To stop it, run 'kylin.sh stop'
-Check the log at /usr/local/apache-kylin-3.1.0-bin-cdh57/logs/kylin.log
+Check the log at /usr/local/apache-kylin-4.0.0-bin/logs/kylin.log
 Web UI is at http://<hostname>:7070/kylin
 ```
 
@@ -121,9 +121,7 @@ $KYLIN_HOME/bin/sample.sh
 ```
 
 完成后登陆kylin,点击System->Configuration->Reload Metadata来重载元数据
-元数据重载完成后你可以在左上角的Project中看到一个名为learn_kylin的项目,它包含kylin_sales_cube和kylin_streaming_cube, 它们分别为batch cube和streaming cube,你可以直接对kylin_sales_cube进行构建,构建完成后就可以查询。
-
-关于sample cube,可以参考[Sample Cube](/cn/docs/tutorial/kylin_sample.html)。
+元数据重载完成后你可以在左上角的Project中看到一个名为learn_kylin的项目,它包含kylin_sales_cube和kylin_streaming_cube, 它们分别为batch cube和streaming cube,不过 kylin4.0 暂时还不支持 streaming cube,你可以直接对kylin_sales_cube进行构建,构建完成后就可以查询。
 
 当然,你也可以根据下面的教程来尝试创建自己的Cube。
 
@@ -138,6 +136,10 @@ $KYLIN_HOME/bin/sample.sh
 点击Model->Data Source->Load Table From Tree,
 Kylin会读取到Hive数据源中的表并以树状方式显示出来,你可以选择自己要使用的表,然后点击sync进行将其加载到kylin。
 
+此外,Kylin4.0 还支持 CSV 格式文件作为数据源,你也可以点击 Model->Data Source->Load CSV File as Table 来加载 CSV 数据源。
+
+本例中仍然使用 Hive 数据源进行讲解与演示。 
+
 ![](/images/docs/quickstart/load_hive_table.png)
 
 #### step11、创建模型
@@ -178,7 +180,7 @@ Kylin会读取到Hive数据源中的表并以树状方式显示出来,你可
 
 ![](/images/docs/quickstart/cube_add_measure.png)
 
-添加完所有Measure后点击Next进行下一步,这一页是关于Cube数据刷新的设置。在这里可以设施自动合并的阈值(Auto Merge Thresholds)、数据保留的最短时间(Retention Threshold)以及第一个Segment的起点时间。
+添加完所有Measure后点击Next进行下一步,这一页是关于Cube数据刷新的设置。在这里可以设置自动合并的阈值(Auto Merge Thresholds)、数据保留的最短时间(Retention Threshold)以及第一个Segment的起点时间。
 
 ![](/images/docs/quickstart/segment_auto_merge.png)
 
diff --git a/website/_docs40/gettingstarted/quickstart.md b/website/_docs40/gettingstarted/quickstart.md
index a45920d..66d0558 100644
--- a/website/_docs40/gettingstarted/quickstart.md
+++ b/website/_docs40/gettingstarted/quickstart.md
@@ -32,34 +32,32 @@ The Linux account running Kylin must have access to the Hadoop cluster, includin
 
  
 
-(4) Software Requirements: Hadoop 2.7+, 3.0-3.1; Hive 0.13+, 1.2.1+; JDK: 1.8+
+(4) Software Requirements: Hadoop 2.7+, 3.0-3.1; Hive 0.13+, 1.2.1+; Spark 2.4.6; JDK: 1.8+
 
  
 
-It is recommended to use an integrated Hadoop environment for Kylin installation and testing, such as Hortonworks HDP or Cloudera CDH. Before Kylin was released, Hortonworks HDP 2.2-2.6 and 3.0, Cloudera CDH 5.7-5.11 and 6.0, AWS EMR 5.7-5.10, and Azure HDInsight 3.5-3.6 passed the test.
+It is recommended to use an integrated Hadoop environment for Kylin installation and testing, such as Hortonworks HDP or Cloudera CDH. Before Kylin was released, Hortonworks HDP 2.4, Cloudera CDH 5.7 and 6.0, AWS EMR 5.31 and 6.0, and Azure HDInsight 4.0 passed the test.
 
 #### Install and Use
 When your environment meets the above prerequisites, you can install and start using Kylin.
 
 #### Step1. Download the Kylin Archive
-Download a binary for your version of Hadoop from [Apache Kylin Download Site](https://kylin.apache.org/download/). Currently, the latest versions are Kylin 3.1.0 and Kylin 2.6.6, of which, version 3.0 supports the function of ingesting data in real time for pre-calculation. If your Hadoop environment is CDH 5.7, you can download Kylin 3.1.0 using the following command line:
-
-```
+Download a kylin4.0 binary package from [Apache Kylin Download Site](https://kylin.apache.org/download/). 
 cd /usr/local/
-wget http://apache.website-solution.net/kylin/apache-kylin-3.1.0/apache-kylin-3.1.0-bin-cdh57.tar.gz
+wget http://apache.website-solution.net/kylin/apache-kylin-4.0.0/apache-kylin-4.0.0-bin.tar.gz
 ```
 
 #### Step2. Extract Kylin
 Extract the downloaded Kylin archive and configure the environment variable KYLIN_HOME to point to the extracted directory:
 
 ```
-tar -zxvf  apache-kylin-3.1.0-bin-cdh57.tar.gz
-cd apache-kylin-3.1.0-bin-cdh57
+tar -zxvf  apache-kylin-4.0.0-bin.tar.gz
+cd apache-kylin-4.0.0-bin
 export KYLIN_HOME=`pwd`
 ```
 
 #### Step3. Download Spark
-Since Kylin checks the Spark environment when it starts, you need to set SPARK_HOME:
+Kylin 4.0 uses spark as query engine and build engine, you need to set SPARK_HOME:
 
 ```
 export SPARK_HOME=/path/to/spark
@@ -101,7 +99,7 @@ Start script to start Kylin. If the startup is successful, the following will be
 
 ```
 A new Kylin instance is started by root. To stop it, run 'kylin.sh stop'
-Check the log at /usr/local/apache-kylin-3.1.0-bin-cdh57/logs/kylin.log
+Check the log at /usr/local/apache-kylin-4.0.0-bin/logs/kylin.log
 Web UI is at http://<hostname>:7070/kylin
 ```
 
@@ -121,11 +119,9 @@ $KYLIN_HOME/bin/sample.sh
 After completing, log in to Kylin, click System -> Configuration -> Reload Metadata to reload the metadata.
 
 After the metadata is reloaded, you can see a project named learn_kylin in Project in the upper left corner. 
-This contains kylin_sales_cube and kylin_streaming_cube, which are a batch cube and a streaming cube, respectively. 
+This contains kylin_sales_cube and kylin_streaming_cube, which are a batch cube and a streaming cube, respectively. However, kylin 4.0 does not support streaming cube yet.
 You can build the kylin_sales_cube directly and you can query it after the build is completed. 
 
-For sample cube, you can refer to:[Sample Cube](/docs/tutorial/kylin_sample.html)
-
 Of course, you can also try to create your own cube based on the following tutorial.
 
 #### Step9. Create Project 
@@ -134,13 +130,17 @@ After logging in to Kylin, click the + in the upper left corner to create a Proj
 ![](/images/docs/quickstart/create_project.png)
 
 #### Step10. Load Hive Table
-Click Model -> the Data Source -> the Load the From the Table Tree. 
+Click `Model -> the Data Source -> the Load the From the Table Tree`. 
 Kylin reads the Hive data source table and displays it in a tree. You can choose the tables you would like to add to models and then click Sync. The selected tables will then be loaded into Kylin.
 
 ![](/images/docs/quickstart/load_hive_table.png)
 
 They then appear in the Tables directory of the data source.
 
+In addition, kylin 4.0 also supports CSV file as data source. You can also click `model -> data source -> Load CSV file as table` to load the CSV data source.
+
+In this example, Hive data source is still used for explanation and demonstration.
+
 #### Step11. Create the Model
 Click Model -> New -> New Model:
 
diff --git a/website/_docs40/howto/howto_build_cube_with_restapi.cn.md b/website/_docs40/howto/howto_build_cube_with_restapi.cn.md
index 3fa3185..898ee95 100644
--- a/website/_docs40/howto/howto_build_cube_with_restapi.cn.md
+++ b/website/_docs40/howto/howto_build_cube_with_restapi.cn.md
@@ -52,3 +52,6 @@ Content-Type: application/json;charset=UTF-8
 
 ### 5.	如果构建任务出现错误,可以重新开始它
 *   `PUT http://localhost:7070/kylin/api/jobs/{job_uuid}/resume`
+
+### 6.  调整某个 cube 中的 cuboid list,触发 optimize segment 任务
+*   `PUT http://localhost:7070/kylin/api/cubes/{cube_name}/optimize2`
diff --git a/website/_docs40/howto/howto_build_cube_with_restapi.md b/website/_docs40/howto/howto_build_cube_with_restapi.md
index a3cd61c..c9b92cf 100644
--- a/website/_docs40/howto/howto_build_cube_with_restapi.md
+++ b/website/_docs40/howto/howto_build_cube_with_restapi.md
@@ -51,3 +51,6 @@ Content-Type: application/json;charset=UTF-8
 
 ### 5.	If the job got errors, you can resume it. 
 *   `PUT http://localhost:7070/kylin/api/jobs/{job_uuid}/resume`
+
+### 6.	Adjust the cuboid list of a cube and trigger optimize segment job
+*   `PUT http://localhost:7070/kylin/api/cubes/{cube_name}/optimize2`
diff --git a/website/_docs40/howto/howto_config_spark_pool.md b/website/_docs40/howto/howto_config_spark_pool.md
index e21cae7..e87abc3 100644
--- a/website/_docs40/howto/howto_config_spark_pool.md
+++ b/website/_docs40/howto/howto_config_spark_pool.md
@@ -1,6 +1,6 @@
 ---
 layout: docs40
-title:  Config Spark Pool
+title:  Config different spark Pool for different types of SQL
 categories: howto
 permalink: /docs40/howto/howto_config_spark_pool.html
 ---
diff --git a/website/_docs40/howto/howto_optimize_build_and_query.cn.md b/website/_docs40/howto/howto_optimize_build_and_query.cn.md
index 93a2233..2e11099 100644
--- a/website/_docs40/howto/howto_optimize_build_and_query.cn.md
+++ b/website/_docs40/howto/howto_optimize_build_and_query.cn.md
@@ -13,4 +13,7 @@ Apache kylin4.0 是继 Kylin3.0 之后一个重大的的架构升级版本,cub
 
 同时提供视频讲解:
 [How to optimize build performance in kylin 4.0](https://www.bilibili.com/video/BV1ry4y1z7Nt) 
-[How to optimize query performance in kylin 4.0](https://www.bilibili.com/video/BV18K411G7k3)
\ No newline at end of file
+[How to optimize query performance in kylin 4.0](https://www.bilibili.com/video/BV18K411G7k3)
+
+以及 Kylin4.0 用户有赞的最佳实践博客:
+[有赞为什么选择 kylin4.0](/cn_blog/2021/06/17/Why-did-Youzan-choose-Kylin4/) 
\ No newline at end of file
diff --git a/website/_docs40/howto/howto_optimize_build_and_query.md b/website/_docs40/howto/howto_optimize_build_and_query.md
index 2891dab..bc3147f 100644
--- a/website/_docs40/howto/howto_optimize_build_and_query.md
+++ b/website/_docs40/howto/howto_optimize_build_and_query.md
@@ -11,6 +11,5 @@ Kylin 4 is a major architecture upgrade version, both cube building engine and q
 About the build/query performance tuning of Apache Kylin4.0, Please refer to: 
 [How to improve cube building and query performance of Apache Kylin4.0](https://cwiki.apache.org/confluence/display/KYLIN/How+to+improve+cube+building+and+query+performance).
 
-At the same time, video version explanation is provided:
-[How to optimize build performance in kylin 4.0](https://www.bilibili.com/video/BV1ry4y1z7Nt) 
-[How to optimize query performance in kylin 4.0](https://www.bilibili.com/video/BV18K411G7k3)
\ No newline at end of file
+At the same time, you can refer to kylin4.0 user's optimization practice blog:
+[why did Youzan choose Kylin4](/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/)
\ No newline at end of file
diff --git a/website/_docs40/tutorial/create_cube.cn.md b/website/_docs40/tutorial/create_cube.cn.md
index c27532b..a64fd3e 100644
--- a/website/_docs40/tutorial/create_cube.cn.md
+++ b/website/_docs40/tutorial/create_cube.cn.md
@@ -46,10 +46,6 @@ since: v0.7.1
 
    ![]( /images/tutorial/1.5/Kylin-Cube-Creation-Tutorial/5 hive-table-info.png)
 
-6. 在后台,Kylin 将会执行 MapReduce 任务计算新同步表的基数(cardinality),任务完成后,刷新页面并点击表名,基数值将会显示在表信息中。
-
-   ![]( /images/tutorial/1.5/Kylin-Cube-Creation-Tutorial/5 hive-table-cardinality.png)
-
 ### III. 新建 Data Model
 创建 cube 前,需定义一个数据模型。数据模型定义了一个星型(star schema)或雪花(snowflake schema)模型。一个模型可以被多个 cube 使用。
 
@@ -121,7 +117,7 @@ cube 名字可以使用字母,数字和下划线(空格不允许)。`Notif
 
    ![]( /images/tutorial/1.5/Kylin-Cube-Creation-Tutorial/8 meas-+meas.png)
 
-2. 根据它的表达式共有7种不同类型的度量:`SUM`、`MAX`、`MIN`、`COUNT`、`COUNT_DISTINCT` `TOP_N`, `EXTENDED_COLUMN` 和 `PERCENTILE`。请合理选择 `COUNT_DISTINCT` 和 `TOP_N` 返回类型,它与 cube 的大小相关。
+2. 根据它的表达式共有7种不同类型的度量:`SUM`、`MAX`、`MIN`、`COUNT`、`COUNT_DISTINCT` `TOP_N` 和 `PERCENTILE`。请合理选择 `COUNT_DISTINCT` 和 `TOP_N` 返回类型,它与 cube 的大小相关。
    * SUM
 
      ![]( /images/tutorial/1.5/Kylin-Cube-Creation-Tutorial/8 measure-sum.png)
@@ -141,7 +137,7 @@ cube 名字可以使用字母,数字和下划线(空格不允许)。`Notif
    * DISTINCT_COUNT
    这个度量有两个实现:
    1)近似实现 HyperLogLog,选择可接受的错误率,低错误率需要更多存储;
-   2)精确实现 bitmap(具体限制请看 https://issues.apache.org/jira/browse/KYLIN-1186)
+   2)精确实现 bitmap(具体实现请看 [Global Dictionary on Kylin 4](https://cwiki.apache.org/confluence/display/KYLIN/Global+Dictionary+on+Spark))
 
      ![]( /images/tutorial/1.5/Kylin-Cube-Creation-Tutorial/8 measure-distinct.png)
    
@@ -155,11 +151,6 @@ cube 名字可以使用字母,数字和下划线(空格不允许)。`Notif
 
      ![]( /images/tutorial/1.5/Kylin-Cube-Creation-Tutorial/8 measure-topn.png)
 
-   * EXTENDED_COLUMN
-   Extended_Column 作为度量比作为维度更节省空间。一列和另一列可以生成新的列。
-   
-     ![]( /images/tutorial/1.5/Kylin-Cube-Creation-Tutorial/8 measure-extended_column.PNG)
-
    * PERCENTILE
    Percentile 代表了百分比。值越大,错误就越少。100为最合适的值。
 
@@ -195,6 +186,8 @@ cube 名字可以使用字母,数字和下划线(空格不允许)。`Notif
 
 你可以拖拽维度列去调整其在 rowkey 中位置; 位于rowkey前面的列,将可以用来大幅缩小查询的范围。通常建议将 mandantory 维度放在开头, 然后是在过滤 ( where 条件)中起到很大作用的维度;如果多个列都会被用于过滤,将高基数的维度(如 user_id)放在低基数的维度(如 age)的前面。
 
+此外,你还可以在这里指定使用某一列作为 shardBy 列,kylin4.0 会根据 shardBy 列对存储文件进行分片,分片能够使查询引擎跳过不必要的文件,提高查询性能,最好选择高基列并且会在多个 cuboid 中出现的列作为 shardBy 列。
+
 `Mandatory Cuboids`: 维度组合白名单。确保你想要构建的 cuboid 能被构建。
 
 `Cube Engine`: cube 构建引擎。Spark构建。
diff --git a/website/_docs40/tutorial/kylin_sample.cn.md b/website/_docs40/tutorial/kylin_sample.cn.md
deleted file mode 100644
index ef0b8b5..0000000
--- a/website/_docs40/tutorial/kylin_sample.cn.md
+++ /dev/null
@@ -1,24 +0,0 @@
----
-layout: docs40-cn
-title:  "样例 Cube 快速入门"
-categories: tutorial
-permalink: /cn/docs40/tutorial/kylin_sample.html
----
-
-Kylin 提供了一个创建样例 Cube 脚本;脚本会创建五个样例 Hive 表:
-
-1. 运行 `${KYLIN_HOME}/bin/sample.sh`;重启 Kylin 服务器刷新缓存;
-2. 用默认的用户名和密码 ADMIN/KYLIN 登陆 Kylin 网站,选择 project 下拉框(左上角)中的 `learn_kylin` 工程;
-3. 选择名为 `kylin_sales_cube` 的样例 Cube,点击 "Actions" -> "Build",选择一个在 2014-01-01 之后的日期(覆盖所有的 10000 样例记录);
-4. 点击 "Monitor" 标签,查看 build 进度直至 100%;
-5. 点击 "Insight" 标签,执行 SQLs,例如:
-
-```
-select part_dt, sum(price) as total_sold, count(distinct seller_id) as sellers from kylin_sales group by part_dt order by part_dt
-```
-
- 6.您可以验证查询结果且与 Hive 的响应时间进行比较;
- 
-## 下一步干什么
-
-您可以通过接下来的教程用同一张表创建另一个 Cube。
diff --git a/website/_docs40/tutorial/kylin_sample.md b/website/_docs40/tutorial/kylin_sample.md
deleted file mode 100644
index 9f7565c..0000000
--- a/website/_docs40/tutorial/kylin_sample.md
+++ /dev/null
@@ -1,24 +0,0 @@
----
-layout: docs40
-title:  Quick Start with Sample Cube
-categories: tutorial
-permalink: /docs40/tutorial/kylin_sample.html
----
-
-Kylin provides a script for you to create a sample Cube; the script will also create five sample Hive tables:
-
-1. Run `${KYLIN_HOME}/bin/sample.sh`; Restart Kylin server to flush the caches;
-2. Logon Kylin web with default user and password ADMIN/KYLIN, select project `learn_kylin` in the project dropdown list (left upper corner);
-3. Select the sample Cube `kylin_sales_cube`, click "Actions" -> "Build", pick up a date later than 2014-01-01 (to cover all 10000 sample records);
-4. Check the build progress in the "Monitor" tab, until 100%;
-5. Execute SQLs in the "Insight" tab, for example:
-
-```
-select part_dt, sum(price) as total_sold, count(distinct seller_id) as sellers from kylin_sales group by part_dt order by part_dt
-```
-
- 6.You can verify the query result and compare the response time with Hive;
- 
-## What's next
-
-You can create another Cube with the sample tables, by following the tutorials.
diff --git a/website/images/docs/quickstart/advance_setting.png b/website/images/docs/quickstart/advance_setting.png
index 265a3be..d21ccc8 100644
Binary files a/website/images/docs/quickstart/advance_setting.png and b/website/images/docs/quickstart/advance_setting.png differ