You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by Chao Long <wa...@qq.com> on 2018/12/17 05:01:48 UTC

回复：Evaluate Kylin on Parquet

In this PoC, we verified Kylin On Parquet is viable, but the query performance still have room to improve. We can improve it from the following aspects:

1, Minimize result set serialization time
Since Kylin need Object[] data to process, we convert Dataset to RDD, and then convert the "Row" type to Object[], so Spark need to serialize Object[] before return it to driver. Those time need to be avoided.

2, Query without dictionary
In this PoC, for less storage use, we keep dict encode value in Parquet file for dict-encode dimensions, so Kylin must load dictionary to convert dict value for query. If we keep original value for dict-encode dimension, dictionary is unnecessary. And we don't hava to worry about the storage use, because Parquet will encode it. We should remove dictionary from query.

3, Remove query single-point issue
In this PoC, we use Spark to read and process Cube data, which is distributed, but kylin alse need to process result data the Spark returned in single jvm. We can try to make it distributed too.

4, Upgrade Parquet to 1.11 for page index
In this PoC, Parquet don't have page index, we get a poor filter performance. We need to upgrade Parquet to version 1.11 which has page index to improve filter performance.

------------------
Best Regards,
Chao Long

------------------ 原始邮件 ------------------
发件人: "ShaoFeng Shi"<sh...@apache.org>;
发送时间: 2018年12月14日(星期五) 下午4:39
收件人: "dev"<de...@kylin.apache.org>;

主题: Evaluate Kylin on Parquet

Hello Kylin users,

The first version of Kylin on Parquet [1] feature has been staged in Kylin
code repository for public review and evaluation. You can check out the
"kylin-on-parquet" branch [2] to read the code, and also can make a binary
build to run an example. When creating a cube, you can select "Parquet" as
the storage in the "Advanced setting" page. Both MapReduce and Spark
engines support this new storage. A tech blog is under drafting for the
design and implementation.

Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!

This is not the final version; there is room to improve in many aspects,
parquet, spark, and Kylin. It can be used for PoC at this moment. Your
comments are welcomed. Let's improve it together.

[1] https://issues.apache.org/jira/browse/KYLIN-3621
[2] https://github.com/apache/kylin/tree/kylin-on-parquet

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org