You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by ya...@apache.org on 2021/10/22 04:58:05 UTC
[kylin] branch document updated: Update kylin diagram and add new blog

This is an automated email from the ASF dual-hosted git repository.

yaqian pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git


The following commit(s) were added to refs/heads/document by this push:
     new 9a814e7  Update kylin diagram and add new blog
9a814e7 is described below

commit 9a814e736165f1d4f839d3bf8f92399cb007ad39
Author: yaqian.zhang <59...@qq.com>
AuthorDate: Thu Oct 21 10:45:33 2021 +0800

    Update kylin diagram and add new blog
---
 ...-Local-Cache-and-Soft-Affinity-Scheduling.cn.md |  63 ++++++++++++++++++
 ...-21-Local-Cache-and-Soft-Affinity-Scheduling.md |  72 +++++++++++++++++++++
 website/assets/images/Kylin_diagram.pptx           | Bin 0 -> 172262 bytes
 website/assets/images/kylin_diagram.png            | Bin 61344 -> 195017 bytes
 .../images/blog/local-cache/Local_cache_stage.png  | Bin 0 -> 103528 bytes
 .../images/blog/local-cache/kylin4_local_cache.png | Bin 0 -> 145137 bytes
 .../local_cache_benchmark_result_ssb.png           | Bin 0 -> 14438 bytes
 .../local_cache_benchmark_result_tpch1.png         | Bin 0 -> 17214 bytes
 .../local_cache_benchmark_result_tpch4.png         | Bin 0 -> 16531 bytes
 9 files changed, 135 insertions(+)

diff --git a/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.cn.md b/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.cn.md
new file mode 100644
index 0000000..cae83bb
--- /dev/null
+++ b/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.cn.md
@@ -0,0 +1,63 @@
+---
+layout: post-blog
+title:  Kylin4 云上性能优化：本地缓存和软亲和性调度
+date:   2021-10-21 11:00:00
+author: 张亚倩
+categories: cn_blog
+---
+
+## 01 背景介绍
+日前，Apache Kylin 社区发布了全新架构的 Kylin 4.0。Kylin 4.0 的架构支持存储和计算分离，这使得 kylin 用户可以采取更加灵活、计算资源可以弹性伸缩的云上部署方式来运行 Kylin 4.0。借助云上的基础设施，用户可以选择使用便宜且可靠的对象存储来储存 cube 数据，比如 S3 等。不过在存储与计算分离的架构下，我们需要考虑到，计算节点通过网络从远端存储读取数据仍然是一个代价较大的操作，往往会带来性能的损耗。
+为了提高 Kylin 4.0 在使用云上对象存储作为存储时的查询性能，我们尝试在 Kylin 4.0 的查询引擎中引入本地缓存（Local Cache）机制，在执行查询时，将经常使用的数据缓存在本地磁盘，减小从远程对象存储中拉取数据带来的延迟，实现更快的查询响应；除此之外，为了避免同样的数据在大量 spark executor 上同时缓存浪费磁盘空间，并且计算节点可以更多的从本地缓存读取所需数据，我们引入了 软亲和性（Soft Affinity ）的调度策略，所谓软亲和性策略，就是通过某种方法在 spark executor 和数据文件之间建立对应关系，使得同样的数据在大部分情况下能够总是在同一个 executor 上面读取，从而提高缓存的命中率。
+
+## 02 实现原理
+
+#### 1.本地缓存
+在 Kylin 4.0 执行查询时，主要经过以下几个阶段，其中用虚线标注出了可以使用本地缓存来提升性能的阶段：
+
+![](/images/blog/local-cache/Local_cache_stage.png)
+
+- File list cache：在 spark driver 端对 file status 进行缓存。在执行查询时，spark driver 需要读取文件列表，获取一些文件信息进行后续的调度执行，这里会将 file status 信息缓存到本地避免频繁读取远程文件目录。
+- Data cache：在 spark executor 端对数据进行缓存。用户可以设置将数据缓存到内存或是磁盘，若设置为缓存到内存，则需要适当调大 executor memory，保证 executor 有足够的内存可以进行数据缓存；若是缓存到磁盘，需要用户设置数据缓存目录，最好设置为 SSD 磁盘目录。除此之外，缓存数据的最大容量、备份数量等均可由用户配置调整。
+
+基于以上设计，在 Kylin 4.0 的查询引擎 sparder 的 driver 端和 executor 端分别做不同类型的缓存，基本架构如下：
+
+![](/images/blog/local-cache/kylin4_local_cache.png)
+
+#### 2.软亲和性调度
+在 executor 端做 data cache 时，如果在所有的 executor 上都缓存全部的数据，那么缓存数据的大小将会非常可观，极大的浪费磁盘空间，同时也容易导致缓存数据被频繁清理。为了最大化 spark executor 的缓存命中率，spark driver 需要将同一文件的 task 在资源条件满足的情况下尽可能调度到同样的 executor，这样可以保证相同文件的数据能够缓存在特定的某个或者某几个 executor 上，再次读取时便可以通过缓存读取数据。
+为此，我们采取根据文件名计算 hash 之后再与 executors num 取模的结果来计算目标 executor 列表，在多少个 executor 上面做缓存由用户配置的缓存备份数量决定，一般情况下，缓存备份数量越大，击中缓存的概率越高。当目标 executor 均不可达或者没有资源供调度时，调度程序将回退到 spark 的随机调度机制上。这种调度方式便称为软亲和性调度策略，它虽然不能保证 100% 击中缓存，但能够有效提高缓存命中率，在尽量不损失性能的前提下避免 full cache 浪费大量磁盘空间。
+
+## 03 相关配置
+根据以上原理，我们在 Kylin 4.0 中实现了本地缓存+软亲和性调度的基础功能，并分别基于 ssb 数据集和 tpch 数据集做了查询性能测试。
+这里列出几个比较重要的配置项供用户了解，实际使用的配置将在结尾链接中给出：
+- 是否开启软亲和性调度策略：kylin.query.spark-conf.spark.kylin.soft-affinity.enabled
+- 是否开启本地缓存：kylin.query.spark-conf.spark.hadoop.spark.kylin.local-cache.enabled
+- Data cache 的备份数量，即在多少个 executor 上对同一数据文件进行缓存：kylin.query.spark-conf.spark.kylin.soft-affinity.replications.num
+- 缓存到内存中还是本地目录，缓存到内存设置为 BUFF，缓存到本地设置为 LOCAL：kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.store.type
+- 最大缓存容量：kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.size
+
+## 04 性能对比
+我们在 AWS EMR 环境下进行了 3 种场景的性能测试，在 scale factor = 10的情况下，对 ssb 数据集进行单并发查询测试、tpch 数据集进行单并发查询以及 4 并发查询测试，实验组和对照组均配置 s3 作为存储，在实验组中开启本地缓存和软亲和性调度，对照组则不开启。除此之外，我们还将实验组结果与相同环境下 hdfs 作为存储时的结果进行对比，以便用户可以直观的感受到 本地缓存+软亲和性调度 对云上部署 Kylin 4.0 使用对象存储作为存储场景下的优化效果。
+
+![](/images/blog/local-cache/local_cache_benchmark_result_ssb.png)
+
+![](/images/blog/local-cache/local_cache_benchmark_result_tpch1.png)
+
+![](/images/blog/local-cache/local_cache_benchmark_result_tpch4.png)
+
+从以上结果可以看出：
+1. 在 ssb 10 数据集单并发场景下，使用 s3 作为存储时，开启本地缓存和软亲和性调度能够获得3倍左右的性能提升，可以达到与 hdfs 作为存储时的相同性能甚至还有 5% 左右的提升。
+2. 在 tpch 10 数据集下，使用 s3 作为存储时，无论是单并发查询还是多并发查询，开启本地缓存和软亲和性调度后，基本在所有查询中都能够获得大幅度的性能提升。
+
+不过在 tpch 10 数据集的 4 并发测试下的 Q21 的对比结果中，我们观察到，开启本地缓存和软亲和性调度的结果反而比单独使用 s3 作为存储时有所下降，这里可能是由于某种原因导致没有通过缓存读取数据，深层原因在此次测试中没有进行进一步的分析，在后续的优化过程中我们会逐步改进。由于 tpch 的查询比较复杂且 SQL 类型各异，与 hdfs 作为存储时的结果相比，仍然有部分 sql 的性能略有不足，不过总体来说已经与 hdfs 的结果比较接近。
+本次性能测试的结果是一次对 本地缓存+软亲和性调度 性能提升效果的初步验证，从总体上来看，本地缓存+软亲和性调度 无论对于简单查询还是复杂查询都能够获得明显的性能提升，但是在高并发查询场景下存在一定的性能损失。
+如果用户使用云上对象存储作为 Kylin 4.0 的存储，在开启 本地缓存+ 软亲和性调度的情况下，是可以获得很好的性能体验的，这为 Kylin 4.0 在云上使用计算和存储分离架构提供了性能保障。
+
+## 05 代码实现
+由于目前的代码实现还处于比较基础的阶段，还有许多细节需要完善，比如实现一致性哈希、当 executor 数量发生变化时如何处理已有 cache 等，所以作者还未向社区代码库提交 PR，想要提前预览的开发者可以通过下面的链接查看源码：
+[Kylin4.0 本地缓存+软亲和性调度代码实现](https://github.com/zzcclp/kylin/commit/4e75b7fa4059dd2eaed24061fda7797fecaf2e35)
+
+## 06 相关链接
+通过链接可查阅性能测试结果数据和具体配置：
+[Kylin4.0 本地缓存+软亲和性调度测试](https://github.com/Kyligence/kylin-tpch/issues/9)
\ No newline at end of file
diff --git a/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.md b/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.md
new file mode 100644
index 0000000..eb5e62e
--- /dev/null
+++ b/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.md
@@ -0,0 +1,72 @@
+---
+layout: post-blog
+title:  Performance optimization of Kylin 4.0 in cloud -- local cache and soft affinity scheduling
+date:   2021-10-21 11:00:00
+author: Yaqian Zhang
+categories: blog
+---
+
+## 01 Background Introduction
+Recently, the Apache Kylin community released Kylin 4.0.0 with a new architecture. The architecture of Kylin 4.0 supports the separation of storage and computing, which enables kylin users to run Kylin 4.0 in a more flexible cloud deployment mode with flexible computing resources. With the cloud infrastructure, users can choose to use cheap and reliable object storage to store cube data, such as S3. However, in the architecture of separation of storage and computing, we need to consider  [...]
+In order to improve the query performance of Kylin 4.0 when using cloud object storage as the storage, we try to introduce the local cache mechanism into the Kylin 4.0 query engine. When executing the query, the frequently used data is cached on the local disk to reduce the delay caused by pulling data from the remote object storage and achieve faster query response. In addition, in order to avoid wasting disk space when the same data is cached on a large number of spark executors at the [...]
+
+## 02 Implementation Principle
+
+#### 1. Local Cache
+
+When Kylin 4.0 executes a query, it mainly goes through the following stages, in which the stages where local cache can be used to improve performance are marked with dotted lines:
+
+![](/images/blog/local-cache/Local_cache_stage.png)
+
+- File list cache：Cache the file status on the spark driver side. When executing the query, the spark driver needs to read the file list and obtain some file information for subsequent scheduling execution. Here, the file status information will be cached locally to avoid frequent reading of remote file directories.
+- Data cache：Cache the data on the spark executor side. You can set the data cache to memory or disk. If it is set to cache to memory, you need to appropriately increase the executor memory to ensure that the executor has enough memory for data cache; If it is cached to disk, you need to set the data cache directory, preferably SSD disk directory.
+
+Based on the above design, different types of caches are made on the driver side and the executor side of the query engine of kylin 4.0. The basic architecture is as follows:
+
+![](/images/blog/local-cache/kylin4_local_cache.png)
+
+#### 2. Soft Affinity Scheduling
+
+When doing data cache on the executor side, if all data is cached on all executors, the size of cached data will be very considerable and a great waste of disk space, and it is easy to cause frequent evict cache data. In order to maximize the cache hit rate of the spark executor, the spark driver needs to schedule the tasks of the same file to the same executor as far as possible when the resource conditions are me, so as to ensure that the data of the same file can be cached on a specif [...]
+To this end, we calculate the target executor list by calculating the hash according to the file name and then modulo with the executor num. The number of executors to cache is determined by the number of data cache replications configured by the user. Generally, the larger the number of cache replications, the higher the probability of hitting the cache. When the target executors are unreachable or have no resources for scheduling, the scheduler will fall back to the random scheduling m [...]
+
+## 03 Related Configuration
+
+According to the above principles, we implemented the basic function of local cache + soft affinity scheduling in Kylin 4.0, and tested the query performance based on SSB data set and TPCH data set respectively.
+Several important configuration items are listed here for users to understand. The actual configuration will be given in the attachment at the end:
+
+- Enable soft affinity scheduling：kylin.query.spark-conf.spark.kylin.soft-affinity.enabled
+- Enable local cache：kylin.query.spark-conf.spark.hadoop.spark.kylin.local-cache.enabled
+- The number of data cache replications, that is, how many executors cache the same data file：kylin.query.spark-conf.spark.kylin.soft-affinity.replications.num
+- Cache to memory or local directory. Set cache to memory as buff and cache to local as local: kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.store.type
+- Maximum cache capacity：kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.size
+
+## 04 Performance Benchmark
+
+We conducted performance tests in three scenarios under AWS EMR environment. When scale factor = 10, we conducted single concurrent query test on SSB dataset, single concurrent query test and 4 concurrent query test on TPCH dataset. S3 was configured as storage in the experimental group and the control group. Local cache and soft affinity scheduling were enabled in the experimental group, but not in the control group. In addition, we also compare the results of the experimental group wit [...]
+
+![](/images/blog/local-cache/local_cache_benchmark_result_ssb.png)
+
+![](/images/blog/local-cache/local_cache_benchmark_result_tpch1.png)
+
+![](/images/blog/local-cache/local_cache_benchmark_result_tpch4.png)
+
+As can be seen from the above results:
+
+1. In the single concurrency scenario of SSB data set, when S3 is used as storage, turning on the local cache and soft affinity scheduling can achieve about three times the performance improvement, which can be the same as that of HDFS, or even improved.
+2. Under TPCH data set, when S3 is used as storage, whether single concurrent query or multiple concurrent query, after local cache and soft affinity scheduling are enabled, the performance of all queries can be greatly improved.
+
+However, in the comparison results of Q21 under the 4 concurrent tests of TPCH dataset, we observed that the results of enabling local cache and soft affinity scheduling are lower than those when using S3 alone as storage. Here, it may be that the data is not read through the cache for some reason. The underlying reason is not further analyzed in this test, in the subsequent optimization process, we will gradually improve. Moreover, because the query of TPCH is complex and the SQL types  [...]
+The result of this performance test is a preliminary verification of the performance improvement effect of local cache + soft affinity scheduling. On the whole, local cache + soft affinity scheduling can achieve significant performance improvement for both simple queries and complex queries, but there is a certain performance loss in the scenario of high concurrent queries.
+If users use cloud object storage as Kylin 4.0 storage, they can get a good performance experience when local cache + soft affinity scheduling is enabled, which provides performance guarantee for Kylin 4.0 to use the separation architecture of computing and storage in the cloud.
+
+## 05 Code Implementation
+
+Since the current code implementation is still in the basic stage, there are still many details to be improved, such as implementing consistent hash, how to deal with the existing cache when the number of executors changes, so the author has not submitted PR to the community code base. Developers who want to preview in advance can view the source code through the following link:
+
+[The code implementation of local cache and soft affinity scheduling](https://github.com/zzcclp/kylin/commit/4e75b7fa4059dd2eaed24061fda7797fecaf2e35)
+
+## 06 Related Link
+
+You can view the performance test result data and specific configuration through the link:
+[The benchmark of Kylin4.0 with local cache and soft affinity scheduling](https://github.com/Kyligence/kylin-tpch/issues/9)
diff --git a/website/assets/images/Kylin_diagram.pptx b/website/assets/images/Kylin_diagram.pptx
new file mode 100644
index 0000000..b9aad8d
Binary files /dev/null and b/website/assets/images/Kylin_diagram.pptx differ
diff --git a/website/assets/images/kylin_diagram.png b/website/assets/images/kylin_diagram.png
index f484778..07e4f05 100644
Binary files a/website/assets/images/kylin_diagram.png and b/website/assets/images/kylin_diagram.png differ
diff --git a/website/images/blog/local-cache/Local_cache_stage.png b/website/images/blog/local-cache/Local_cache_stage.png
new file mode 100644
index 0000000..b895540
Binary files /dev/null and b/website/images/blog/local-cache/Local_cache_stage.png differ
diff --git a/website/images/blog/local-cache/kylin4_local_cache.png b/website/images/blog/local-cache/kylin4_local_cache.png
new file mode 100644
index 0000000..3dc7fe2
Binary files /dev/null and b/website/images/blog/local-cache/kylin4_local_cache.png differ
diff --git a/website/images/blog/local-cache/local_cache_benchmark_result_ssb.png b/website/images/blog/local-cache/local_cache_benchmark_result_ssb.png
new file mode 100644
index 0000000..4bb861b
Binary files /dev/null and b/website/images/blog/local-cache/local_cache_benchmark_result_ssb.png differ
diff --git a/website/images/blog/local-cache/local_cache_benchmark_result_tpch1.png b/website/images/blog/local-cache/local_cache_benchmark_result_tpch1.png
new file mode 100644
index 0000000..2c71d5c
Binary files /dev/null and b/website/images/blog/local-cache/local_cache_benchmark_result_tpch1.png differ
diff --git a/website/images/blog/local-cache/local_cache_benchmark_result_tpch4.png b/website/images/blog/local-cache/local_cache_benchmark_result_tpch4.png
new file mode 100644
index 0000000..715a287
Binary files /dev/null and b/website/images/blog/local-cache/local_cache_benchmark_result_tpch4.png differ