You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by sh...@apache.org on 2018/09/21 03:19:53 UTC
[kylin] branch document updated: Add blog for v2.5.0 release

This is an automated email from the ASF dual-hosted git repository.

shaofengshi pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git


The following commit(s) were added to refs/heads/document by this push:
     new e12c010  Add blog for v2.5.0 release
e12c010 is described below

commit e12c0108bf52758691ea4b7a46d0635a0a61bfd0
Author: shaofengshi <sh...@apache.org>
AuthorDate: Fri Sep 21 11:19:40 2018 +0800

    Add blog for v2.5.0 release
---
 website/_dev/howto_release.md                      |  4 +-
 .../_posts/blog/2018-09-20-release-v2.5.0.cn.md    | 63 ++++++++++++++++++++
 website/_posts/blog/2018-09-20-release-v2.5.0.md   | 68 ++++++++++++++++++++++
 website/download/index.cn.md                       |  2 +-
 website/download/index.md                          |  2 +-
 5 files changed, 135 insertions(+), 4 deletions(-)

diff --git a/website/_dev/howto_release.md b/website/_dev/howto_release.md
index 9821f25..5c0477b 100644
--- a/website/_dev/howto_release.md
+++ b/website/_dev/howto_release.md
@@ -464,12 +464,12 @@ You can download the source release and binary packages from Apache Kylin's down
 
 Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Apache Hadoop, supporting extremely large datasets.
 
-Apache Kylin lets you query massive data set at sub-second latency in 3 steps:
+Apache Kylin lets you query massive dataset at sub-second latency in 3 steps:
 1. Identify a star schema or snowflake schema data set on Hadoop.
 2. Build Cube on Hadoop.
 3. Query data with ANSI-SQL and get results in sub-second, via ODBC, JDBC or RESTful API.
 
-Thanks everyone who have contributed to the 2.5.0 release.
+Thanks to everyone who has contributed to the 2.5.0 release.
 
 We welcome your help and feedback. For more information on how to
 report problems, and to get involved, visit the project website at
diff --git a/website/_posts/blog/2018-09-20-release-v2.5.0.cn.md b/website/_posts/blog/2018-09-20-release-v2.5.0.cn.md
new file mode 100644
index 0000000..ff3cb33
--- /dev/null
+++ b/website/_posts/blog/2018-09-20-release-v2.5.0.cn.md
@@ -0,0 +1,63 @@
+---
+layout: post-blog
+title:  Apache Kylin v2.5.0 正式发布
+date:   2018-09-20 20:00:00
+author: Shaofeng Shi
+categories: blog
+---
+
+近日Apache Kylin 社区很高兴地宣布，Apache Kylin 2.5.0 正式发布。
+
+Apache Kylin 是一个开源的分布式分析引擎，旨在为极大数据集提供 SQL 接口和多维分析（OLAP）的能力。
+
+这是继2.4.0 后的一个新功能版本。该版本引入了很多有价值的改进，完整的改动列表请参见[release notes](https://kylin.apache.org/docs/release_notes.html)；这里挑一些主要改进做说明：
+
+
+### All-in-Spark 的 Cubing 引擎
+Kylin 的 Spark 引擎将使用 Spark 运行 cube 计算中的所有分布式作业，包括获取各个维度的不同值，将 cuboid 文件转换为 HBase HFile，合并 segment，合并词典等。默认的 Spark 配置也经过优化，使得用户可以获得开箱即用的体验。相关开发任务是 KYLIN-3427, KYLIN-3441, KYLIN-3442.
+
+Spark 任务管理也有所改进：一旦 Spark 任务开始运行，您就可以在Web控制台上获得作业链接；如果您丢弃该作业，Kylin 将立刻终止 Spark 作业以及时释放资源；如果重新启动 Kylin，它可以从上一个作业恢复，而不是重新提交新作业.  
+
+### MySQL 做 Kylin 元数据的存储
+在过去，HBase 是 Kylin 元数据存储的唯一选择。 在某些情况下 HBase不适用，例如使用多个 HBase 集群来为 Kylin 提供跨区域的高可用，这里复制的 HBase 集群是只读的，所以不能做元数据存储。现在我们引入了 MySQL Metastore 以满足这种需求。此功能现在处于测试阶段。更多内容参见 KYLIN-3488。
+
+### Hybrid model 图形界面
+Hybrid 是一种用于组装多个 cube 的高级模型。 它可用于满足 cube 的 schema 要发生改变的情况。这个功能过去没有图形界面，因此只有一小部分用户知道它。现在我们在 Web 界面上开启了它，以便更多用户可以尝试。
+
+### 默认开启 Cube planner 
+Cube planner 可以极大地优化 cube 结构，减少构建的 cuboid 数量，从而节省计算/存储资源并提高查询性能。它是在v2.3中引入的，但默认情况下没有开启。为了让更多用户看到并尝试它，我们默认在v2.5中启用它。 算法将在第一次构建 segment 的时候，根据数据统计自动优化 cuboid 集合.
+
+### 改进的 Segment 剪枝
+Segment（分区）修剪可以有效地减少磁盘和网络I / O，因此大大提高了查询性能。 过去，Kylin 只按分区列 (partition date column) 的值进行 segment 的修剪。 如果查询中没有将分区列作为过滤条件，那么修剪将不起作用，会扫描所有segment。.
+现在从v2.5开始，Kylin 将在 segment 级别记录每个维度的最小/最大值。 在扫描 segment 之前，会将查询的条件与最小/最大索引进行比较。 如果不匹配，将跳过该 segment。 检查KYLIN-3370了解更多信息。
+
+### 在 YARN 上合并字典
+当 segment 合并时，它们的词典也需要合并。在过去，字典合并发生在 Kylin 的 JVM 中，这需要使用大量的本地内存和 CPU 资源。 在极端情况下（如果有几个并发作业），可能会导致 Kylin 进程崩溃。 因此，一些用户不得不为 Kylin 任务节点分配更多内存，或运行多个任务节点以平衡工作负载。
+现在从v2.5开始，Kylin 将把这项任务提交给 Hadoop MapReduce 和 Spark，这样就可以解决这个瓶颈问题。 查看KYLIN-3471了解更多信息.
+
+### 改进使用全局字典的 cube 构建性能
+全局字典 (Global Dictionary) 是 bitmap 精确去重计数的必要条件。如果去重列具有非常高的基数，则 GD 可能非常大。在 cube 构建阶段，Kylin 需要通过 GD 将非整数值转换为整数。尽管 GD 已被分成多个切片，可以分开加载到内存，但是由于去重列的值是乱序的。Kylin 需要反复载入和载出(swap in/out)切片，这会导致构建任务非常缓慢。
+该增强功能引入了一个新步骤，为每个数据块从全局字典中构建一个缩小的字典。 随后每个任务只需要加载缩小的字典，从而避免频繁的载入和载出。性能可以比以前快3倍。查看 KYLIN-3491 了解更多信息.
+
+### 改进含 TOPN, COUNT DISTINCT 的 cube 大小的估计
+Cube 的大小在构建时是预先估计的，并被后续几个步骤使用，例如决定 MR / Spark 作业的分区数，计算 HBase region 切割等。它的准确与否会对构建性能产生很大影响。 当存在 COUNT DISTINCT，TOPN 的度量时候，因为它们的大小是灵活的，因此估计值可能跟真实值有很大偏差。 在过去，用户需要调整若干个参数以使尺寸估计更接近实际尺寸，这对普通用户有点困难。
+现在，Kylin 将根据收集的统计信息自动调整大小估计。这可以使估计值与实际大小更接近。查看 KYLIN-3453 了解更多信息。
+
+### 支持Hadoop 3.0/HBase 2.0
+Hadoop 3和 HBase 2开始被许多用户采用。现在 Kylin 提供使用新的 Hadoop 和 HBase API 编译的新二进制包。我们已经在 Hortonworks HDP 3.0 和 Cloudera CDH 6.0 上进行了测试
+
+
+__下载__
+
+要下载Apache Kylin v2.5.0源代码或二进制包，请访问[下载页面](http://kylin.apache.org/download) .
+
+__升级__
+ 
+参考[升级指南](/docs/howto/howto_upgrade.html).
+
+__反馈__
+
+如果您遇到问题或疑问，请发送邮件至 Apache Kylin dev 或 user 邮件列表：dev@kylin.apache.org，user@kylin.apache.org; 在发送之前，请确保您已通过发送电子邮件至 dev-subscribe@kylin.apache.org 或 user-subscribe@kylin.apache.org订阅了邮件列表。
+
+
+_非常感谢所有贡献Apache Kylin的朋友!_
diff --git a/website/_posts/blog/2018-09-20-release-v2.5.0.md b/website/_posts/blog/2018-09-20-release-v2.5.0.md
new file mode 100644
index 0000000..a8a18cb
--- /dev/null
+++ b/website/_posts/blog/2018-09-20-release-v2.5.0.md
@@ -0,0 +1,68 @@
+---
+layout: post-blog
+title:  Apache Kylin v2.5.0 Release Announcement
+date:   2018-09-20 20:00:00
+author: Shaofeng Shi
+categories: blog
+---
+
+The Apache Kylin community is pleased to announce the release of Apache Kylin v2.5.0.
+
+Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Big Data supporting extremely large datasets.
+
+This is a major release after 2.4.0, including many enhancements. All of the changes can be found in the [release notes](https://kylin.apache.org/docs/release_notes.html). Here just highlight the major ones:
+
+### The all-in-Spark cubing engine
+Now Kylin's Spark engine will run all distributed jobs in Spark, including fetch distinct dimension values, converting cuboid files to HBase HFile, merging segments, merging dictionaries, etc. The default configurations are tuned so the user can get an out-of-box experience. The overall performance with the previous version is close, but we assume Spark has more room to improve. The related tasks are KYLIN-3427, KYLIN-3441, KYLIN-3442.
+
+There are also improvements in the job management. Now you can get the job link on the web console once Spark starts to run. If you discard the job, Kylin will kill the Spark job to release the resource in time. If Kylin is restarted, it can resume from the previous job instead of resubmitting a new job.  
+### MySQL as Kylin metastore
+In the past, HBase is the only option for Kylin metadata. In some cases, this is not applicable, for example using replicated HBase cluster for Kylin's HA (the replicated HBase is read only). Now we introduce the MySQL metastore to fulfill such need. This function is in beta now. Check KYLIN-3488 for more.
+
+### Hybrid model web GUI
+Hybrid is an advanced model for compositing multiple Cubes. It can be used for the Cube schema change issue. This function had no GUI in the past so only a small portion of Kylin users know it. Now we added the web GUI for it so everyone can try it.
+
+### Enable Cube planner by default
+The Cube planner can greatly optimize the cube structure, save the computing/storage resources and improve the query performance. It was introduced in v2.3 but is disabled by default. In order to let more users seeing and trying it, we enable it by default in v2.5. The algorithm will automatically optimize the cube by your data statistics on the first build.
+
+### Advanced segment pruning
+Segment (partition) pruning can efficiently reduce the disk and network I/O, so to greatly improve the query performance. In the past, Kylin only prunes segments by the partition column's value. If the query doesn't have the partition column as the filtering condition, the pruning won't work, all segments will be scanned.
+
+Now from v2.5, Kylin will record the min/max value for EVERY dimension at the segment level. Before scanning a segment, it will compare the query's conditions with the min/max index. If not matched, the segment will be skipped. Check KYLIN-3370 for more.
+
+### Merge dictionary on YARN
+
+When segments get merged, their dictionaries also need to be merged. In the past, the merging happens in Kylin's JVM, which takes a lot of memory and CPU resources. In extreme case (if you have a couple of concurrent jobs) it may crash the Kylin process. Since this, some users have to allocate much more memory to Kylin job node or run multiple job nodes to balance the workload.
+
+Now from v2.5, Kylin will submit this task to Hadoop MR or Spark, so this bottleneck can be solved. Check KYLIN-3471 for more.
+
+### Improve building performance for reading Global Dictionary
+
+Global Dictionary is a must for bitmap count distinct. The GD can be very large if the column has a very high cardinality. In the cube building phase, Kylin need to translate the non-integer values to integers by the GD. Although the GD has been split into several slices, the values are often scrambled. Kylin needs swap in/out the slices into memory repeatedly, which causes the building slowly.
+
+The enhancement introduces a new step to build a shrunken dictionary for each data block. Then each task only loads the shrunken dictionary, which is quite small, so there is no swap in/out any more in the cubing step. Then the performance can be 3x faster than before. Check KYLIN-3491 for more.
+
+### Improved cube size estimation for TOPN, COUNT DISTINCT
+
+Cube size estimation is used in several steps, such as decides the MR/Spark job partition number, calculates the HBase region number etc. It will affect the build performance much. The estimation can be wild when there is COUNT DISTINCT, TOPN measures because their size is flexible. The incorrect estimation may cause too many data partitions and then too many tasks. In the past, users need to tune several parameters to make the size estimation more close to real size, that is hard to do.
+
+Now Kylin will correct the size estimation automatically based on the collected data statistics. This can make the estimation much closer with the real size than before. Check KYLIN-3453 for more.
+
+### Hadoop 3.0/HBase 2.0 support 
+
+Hadoop 3 and HBase 2 starts to be adopted by many users. Now we provide new binary packages compiled with the new Hadoop and HBase API. We tested them on Hortonworks HDP 3.0 and Cloudera CDH 6.0.
+
+
+__Download__
+
+To download Apache Kylin v2.5.0 source code or binary package, visit the [download](http://kylin.apache.org/download) page.
+
+__Upgrade__
+ 
+Follow the [upgrade guide](/docs/howto/howto_upgrade.html).
+
+__Feedback__
+
+If you face issue or question, please send mail to Apache Kylin dev or user mailing list: dev@kylin.apache.org , user@kylin.apache.org; Before sending, please make sure you have subscribed the mailing list by dropping an email to dev-subscribe@kylin.apache.org or user-subscribe@kylin.apache.org.
+
+_Great thanks to everyone who contributed!_
diff --git a/website/download/index.cn.md b/website/download/index.cn.md
index 8db4150..20b77e9 100644
--- a/website/download/index.cn.md
+++ b/website/download/index.cn.md
@@ -6,7 +6,7 @@ title: 下载
 您可以按照这些[步骤](https://www.apache.org/info/verification.html) 并使用这些[KEYS](https://www.apache.org/dist/kylin/KEYS)来验证下载文件的有效性.
 
 #### v2.5.0
-- 这是2.4版本后的一个主要发布版本，包含了96 个以及各种改进。关于具体内容请查看发布说明. 
+- 这是2.4版本后的一个主要发布版本，包含了96 个以及各种改进。关于具体内容请查看[v2.5.0 正式发布](/blog/2018/09/20/release-v2.5.0/). 
 - [发布说明](/docs/release_notes.html) and [升级指南](/docs/howto/howto_upgrade.html)
 - 源码下载: [apache-kylin-2.5.0-source-release.zip](https://www.apache.org/dyn/closer.cgi/kylin/apache-kylin-2.5.0/apache-kylin-2.5.0-source-release.zip) \[[asc](https://www.apache.org/dist/kylin/apache-kylin-2.5.0/apache-kylin-2.5.0-source-release.zip.asc)\] \[[sha256](https://www.apache.org/dist/kylin/apache-kylin-2.5.0/apache-kylin-2.5.0-source-release.zip.sha256)\]
 - 二进制包下载:
diff --git a/website/download/index.md b/website/download/index.md
index dcac201..fc4cc5f 100644
--- a/website/download/index.md
+++ b/website/download/index.md
@@ -7,7 +7,7 @@ permalink: /download/index.html
 You can verify the download by following these [procedures](https://www.apache.org/info/verification.html) and using these [KEYS](https://www.apache.org/dist/kylin/KEYS).
 
 #### v2.5.0
-- This is a major release after 2.4, with 96 bug fixes and enhancement. For the detail list please check release notes. 
+- This is a major release after 2.4, with 96 bug fixes and enhancement. Check the [v2.5.0 release announcement](/blog/2018/09/20/release-v2.5.0/) and the release notes. 
 - [Release notes](/docs/release_notes.html) and [upgrade guide](/docs/howto/howto_upgrade.html)
 - Source download: [apache-kylin-2.5.0-source-release.zip](https://www.apache.org/dyn/closer.cgi/kylin/apache-kylin-2.5.0/apache-kylin-2.5.0-source-release.zip) \[[asc](https://www.apache.org/dist/kylin/apache-kylin-2.5.0/apache-kylin-2.5.0-source-release.zip.asc)\] \[[sha256](https://www.apache.org/dist/kylin/apache-kylin-2.5.0/apache-kylin-2.5.0-source-release.zip.sha256)\]
 - Binary download: