You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Chao Long <wa...@qq.com> on 2018/12/17 04:59:01 UTC

回复：Evaluate Kylin on Parquet

In this PoC, we verified Kylin On Parquet is viable, but the query performance still have room to improve. We can improve it from the following aspects:

1, Minimize result set serialization time
Since Kylin need Object[] data to process, we convert Dataset to RDD, and then convert the "Row" type to Object[], so Spark need to serialize Object[] before return it to driver. Those time need to be avoided.

2, Query without dictionary
In this PoC, for less storage use, we keep dict encode value in Parquet file for dict-encode dimensions, so Kylin must load dictionary to convert dict value for query. If we keep original value for dict-encode dimension, dictionary is unnecessary. And we don't hava to worry about the storage use, because Parquet will encode it. We should remove dictionary from query.

3, Remove query single-point issue
In this PoC, we use Spark to read and process Cube data, which is distributed, but kylin alse need to process result data the Spark returned in single jvm. We can try to make it distributed too.

4, Upgrade Parquet to 1.11 for page index
In this PoC, Parquet don't have page index, we get a poor filter performance. We need to upgrade Parquet to version 1.11 which has page index to improve filter performance.

------------------
Best Regards,
Chao Long

------------------ 原始邮件 ------------------
发件人: "ShaoFeng Shi"<sh...@apache.org>;
发送时间: 2018年12月14日(星期五) 下午4:39
收件人: "dev"<de...@kylin.apache.org>;

主题: Evaluate Kylin on Parquet

Hello Kylin users,

The first version of Kylin on Parquet [1] feature has been staged in Kylin code repository for public review and evaluation. You can check out the "kylin-on-parquet" branch [2] to read the code, and also can make a binary build to run an example. When creating a cube, you can select "Parquet" as the storage in the "Advanced setting" page. Both MapReduce and Spark engines support this new storage. A tech blog is under drafting for the design and implementation.

Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!

This is not the final version; there is room to improve in many aspects, parquet, spark, and Kylin. It can be used for PoC at this moment. Your comments are welcomed. Let's improve it together.

[1] https://issues.apache.org/jira/browse/KYLIN-3621
[2] https://github.com/apache/kylin/tree/kylin-on-parquet

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io

Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org

Re: Re: Evaluate Kylin on Parquet

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Yang,

The real-time streaming feature is also under review and testing now. I
think when they (new storage and real-time) are ready to be merged, we can
propose to jump the version to 3.0.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Li Yang <li...@apache.org> 于2019年1月1日周二 下午12:40写道：

> From the discussion, apparently a new storage will be added sooner or late.
>
> Will it be a new big version of Kylin? Like Apache Kylin 3.0? Also how
> about the migration from old storage? I assume old cube data has to be
> transformed and loaded into the new storage.
>
> Yang
>
> On Sat, Dec 29, 2018 at 5:52 PM ShaoFeng Shi <sh...@apache.org>
> wrote:
>
>> Thanks very much for Yiming and Jiatao's comments, they're very valueable.
>> There are many improvements can do for this new storage. We welcome all
>> kinds of contribution and would like to improve it together with the
>> community in the year of 2019!
>>
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>> Apache Kylin PMC
>> Work email: shaofeng.shi@kyligence.io
>> Kyligence Inc: https://kyligence.io/
>>
>> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
>> Join Kylin user mail group: user-subscribe@kylin.apache.org
>> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>>
>>
>>
>>
>> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午8:44写道：
>>
>> > Hi all,
>> >
>> > Truly agreed with Yiming, and here I expand a little more about
>> > "Distributed computing".
>> >
>> > As Yiming mentioned, Kylin will parse the query into an execution plan
>> > using Calcite(Kylin will change the execution plan cuz the data in
>> cubes is
>> > already aggregated, we cannot use the origin plan directly). It's a tree
>> > structure, a node represents a specific calculation and data goes from
>> > bottom to top applying all these calculations.
>> > [image: image.png]
>> > (Pic from https://blog.csdn.net/yu616568/article/details/50838504, a
>> > really good blog.)
>> >
>> > At present, Kylin will do almost all these calculations only in its own
>> > node, in other words, we cannot fully use the power of the cluster, and
>> > it's a SPOF. And here comes a design that we can visit this tree, *and
>> > transform each node into operations to Spark's Dataframes(i.e. "DF").*
>> >
>> > More specifically, we will visit the nodes recursively until we met the
>> > "TableScan" node(like a stack pushing operation). e.g. In the above
>> > diagram, the first node we met is a "Sort" node, we just visit its
>> > child(ren), and we'll not stop visiting each node's child(ren) until we
>> met
>> > a "TableScan" node.
>> >
>> > In the "TableScan" node, we will generate the initial DF, then the DF
>> will
>> > be poped to the "Filter" node, and the "Filter" node will apply its own
>> > operation like "df.filter(xxx)". Finally, we will apply each node's
>> > operation to this DF, and the final call chain will like:
>> > "df.filter(xxx).select(xxx).agg(xxx).sort(xxx)".
>> >
>> > After we got the final Dataframe and triggered the calculation, all the
>> > rest were handled by Spark. And we can gain tremendous benefits in
>> > computation level, more details can be seen in my previous post:
>> >
>> http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html
>> > .
>> >
>> >
>> > --
>> >
>> >
>> > Regards!
>> >
>> > Aron Tao
>> >
>> >
>> > 许益铭 <x1...@gmail.com> 于2018年12月19日周三 上午11:40写道：
>> >
>> >> hi All!
>> >> 关于CHAO LONG提到的几个问题,我有以下几个看法:
>> >>
>> >>
>> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
>> >> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime
>> >> 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
>> >> server GC严重
>> >>
>> >>
>> >>
>> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.
>> >>
>> >> 3.我们要使用parquet的page
>> >>
>> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
>> >> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.
>> >>
>> >> 我觉得使用spark来做我们的计算引擎能解决上述所有问题:
>> >>
>> >> 1.分布式计算
>> >> sql通过calcite解析优化之后会生成olap
>> >>
>> >>
>> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
>> >> server端的压力.
>> >>
>> >> 2.去掉字典
>> >>
>> >>
>> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.
>> >>
>> >> 3.parquet存储使用列的真实类型,而不是使用binary
>> >>
>> >>
>> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.
>> >>
>> >> 4.使用spark适配parquet
>> >> 当前的spark已经适配了parquet,spark的pushed
>> >> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
>> >> index能力.
>> >>
>> >> 5.index server
>> >> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
>> >> index的一种,因为我们可以在这里插入一个index server.
>> >>
>> >>
>> >> hi,all!
>> >> I have the following views:
>> >> 1. At present, our architecture is divided into two layers, one is the
>> >> storage layer, and the other is the computing layer. In the storage
>> layer,
>> >> we have made some optimizations and do pre-aggregation in the storage
>> >> layer
>> >> to reduce the amount of data returned. However, the aggregation and
>> >> connection of the runtime occurs on the kylin server side.
>> Serialization
>> >> is
>> >> inevitable, and this architecture is easy to cause a single point
>> >> bottleneck. If the agg or join data of the runtime is relatively large,
>> >> the
>> >> query performance will drop linearly, and the kylin server GC will be
>> >> severe.
>> >>
>> >> 2. As for the dictionary problem, canceling dictionary encoding is a
>> good
>> >> choice. The dictionary was originally designed to align rowkey in hbase
>> >> and
>> >> also to reduce part of the storage. But this also introduces another
>> >> problem, it is difficult to handle non-fixed string type dimension If
>> you
>> >> encounter a UHC dimension, you can only create a large dictionary or
>> give
>> >> a
>> >> larger fix-length, which causes the storage to double, and because the
>> >> dictionary is large, the query performance will be greatly affected. We
>> >> use
>> >> columnar storage, we don't need to consider this problem.
>> >>
>> >> 3. We need to use the page index of the parquet, we must convert the
>> tuple
>> >> filter into the filter of the parquet. This workload is not small. And
>> our
>> >> data is encoded. The page index of the parquet will only be based on
>> the
>> >> min and max value on the page. Filtering, so for binary data, it is
>> >> impossible to do filter.
>> >>
>> >> I think using spark to do our calculation engine solves all of the
>> above
>> >> problems:
>> >>
>> >> Distributed computing
>> >> Sql through calcite analysis optimization will generate a tree of OLAP
>> >> rel,
>> >> and spark's catalyst is also generated by parsing SQL after a tree,
>> >> automatically optimized to become a dataframe to calculate, if the
>> plan of
>> >> calcite can be converted into a spark plan, then we will achieve
>> >> distributed computing, calcite is only responsible for parsing SQL and
>> >> returning result sets, reducing the pressure on the kylin server side.
>> >>
>> >> 2. Remove the dictionary
>> >> The dictionary has a very good effect to reduce the storage pressure in
>> >> the
>> >> low and medium base, but there is also a disadvantage that its data
>> files
>> >> can not be used separately from the dictionary. I suggest that you can
>> use
>> >> the page level of the dictionary without considering the dictionary
>> type
>> >> encoding.
>> >>
>> >> 3.parquet storage uses the true type of the column instead of using
>> binary
>> >> As above, parquet has a very weak filter capability for binary, and the
>> >> basic type can directly use spark's Vectorizedread to speed up data
>> >> reading
>> >> speed and calculation.
>> >>
>> >> 4. Use spark to match the parquet
>> >> The current spark has been adapted to the parquet. The sparked filter
>> of
>> >> the spark has been converted into a filter that can be used by the
>> >> parquet.
>> >> Here, you only need to upgrade the version of the parcel and modify it
>> to
>> >> provide the page index of the parquet.
>> >>
>> >> 5.index server
>> >> As described by JiaTao Tao, the index server is divided into file index
>> >> and
>> >> page index. The filtering of the dictionary is nothing but a file
>> index,
>> >> because we can insert an index server here.
>> >>
>> >> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午4:45写道：
>> >>
>> >> > Hi Gang
>> >> >
>> >> > In my opinion, segments/partition pruning is actually in the scope of
>> >> > "Index system", we can have an "Index system" in storage level
>> including
>> >> > File index(for segment/partition pruning), page index(for page
>> pruning)
>> >> > etc. We can put all these stuff in such a system and make the
>> >> separation of
>> >> > duties cleaner.
>> >> >
>> >> >
>> >> > Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：
>> >> >
>> >> > > Awesome! Looking forward to the improvement. For dictionary, keep
>> the
>> >> > > dictionary in query engine, most time is not good since it brings
>> >> lots of
>> >> > > pressure to Kylin server, but sometimes it has benefit, for
>> example,
>> >> some
>> >> > > segments can be pruned very early when filter value is not in the
>> >> > > dictionary, and some queries can be answer directly using
>> dictionary
>> >> as
>> >> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
>> >> > >
>> >> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org>
>> >> wrote:
>> >> > >
>> >> > > The dimension dictionary is a legacy design for HBase storage I
>> think;
>> >> > > because HBase has no data type, everything is a byte array, this
>> makes
>> >> > > Kylin has to encode STRING and other types with some encoding
>> method
>> >> like
>> >> > > the dictionary.
>> >> > >
>> >> > > Now with the storage like Parquet, it would decide how to encode
>> the
>> >> data
>> >> > > at the page or block level. Then we can drop the dictionary after
>> the
>> >> > cube
>> >> > > is built. This will release the memory pressure of Kylin query
>> nodes
>> >> and
>> >> > > also benefit the UHC case.
>> >> > >
>> >> > > Best regards,
>> >> > >
>> >> > > Shaofeng Shi 史少锋
>> >> > > Apache Kylin PMC
>> >> > > Work email: shaofeng.shi@kyligence.io
>> >> > > Kyligence Inc: https://kyligence.io/
>> >> > >
>> >> > > Apache Kylin FAQ:
>> >> https://kylin.apache.org/docs/gettingstarted/faq.html
>> >> > > Join Kylin user mail group: user-subscribe@kylin.apache.org
>> >> > > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
>> >> > >
>> >> > >>  In this PoC, we verified Kylin On Parquet is viable, but the
>> query
>> >> > >> performance still have room to improve. We can improve it from the
>> >> > >> following aspects:
>> >> > >>
>> >> > >>  1, Minimize result set serialization time
>> >> > >>  Since Kylin need Object[] data to process, we convert Dataset to
>> >> RDD,
>> >> > >> and then convert the "Row" type to Object[], so Spark need to
>> >> serialize
>> >> > >> Object[] before return it to driver. Those time need to be
>> avoided.
>> >> > >>
>> >> > >>  2, Query without dictionary
>> >> > >>  In this PoC, for less storage use, we keep dict encode value in
>> >> Parquet
>> >> > >> file for dict-encode dimensions, so Kylin must load dictionary to
>> >> > convert
>> >> > >> dict value for query. If we keep original value for dict-encode
>> >> > dimension,
>> >> > >> dictionary is unnecessary. And we don't hava to worry about the
>> >> storage
>> >> > >> use, because Parquet will encode it. We should remove dictionary
>> from
>> >> > query.
>> >> > >>
>> >> > >>  3, Remove query single-point issue
>> >> > >>  In this PoC, we use Spark to read and process Cube data, which is
>> >> > >> distributed, but kylin alse need to process result data the Spark
>> >> > returned
>> >> > >> in single jvm. We can try to make it distributed too.
>> >> > >>
>> >> > >>  4, Upgrade Parquet to 1.11 for page index
>> >> > >>  In this PoC, Parquet don't have page index, we get a poor filter
>> >> > >> performance. We need to upgrade Parquet to version 1.11 which has
>> >> page
>> >> > >> index to improve filter performance.
>> >> > >>
>> >> > >> ------------------
>> >> > >> Best Regards,
>> >> > >> Chao Long
>> >> > >>
>> >> > >> ------------------ 原始邮件 ------------------
>> >> > >> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
>> >> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> >> > >> *收件人:* "dev"<de...@kylin.apache.org>;
>> >> > >> *主题:* Evaluate Kylin on Parquet
>> >> > >>
>> >> > >> Hello Kylin users,
>> >> > >>
>> >> > >> The first version of Kylin on Parquet [1] feature has been staged
>> in
>> >> > >> Kylin code repository for public review and evaluation. You can
>> check
>> >> > out
>> >> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
>> >> make a
>> >> > >> binary build to run an example. When creating a cube, you can
>> select
>> >> > >> "Parquet" as the storage in the "Advanced setting" page. Both
>> >> MapReduce
>> >> > and
>> >> > >> Spark engines support this new storage. A tech blog is under
>> drafting
>> >> > for
>> >> > >> the design and implementation.
>> >> > >>
>> >> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
>> >> Zhou!
>> >> > >>
>> >> > >> This is not the final version; there is room to improve in many
>> >> aspects,
>> >> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
>> >> Your
>> >> > >> comments are welcomed. Let's improve it together.
>> >> > >>
>> >> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> >> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>> >> > >>
>> >> > >> Best regards,
>> >> > >>
>> >> > >> Shaofeng Shi 史少锋
>> >> > >> Apache Kylin PMC
>> >> > >> Work email: shaofeng.shi@kyligence.io
>> >> > >> Kyligence Inc: https://kyligence.io/
>> >> > >>
>> >> > >> Apache Kylin FAQ:
>> >> https://kylin.apache.org/docs/gettingstarted/faq.html
>> >> > >> Join Kylin user mail group: user-subscribe@kylin.apache.org
>> >> > >> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >
>> >> > >
>> >> > >
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> >
>> >> > Regards!
>> >> >
>> >> > Aron Tao
>> >> >
>> >>
>> >
>> >
>>
>

Re: Re: Evaluate Kylin on Parquet

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Yang,

The real-time streaming feature is also under review and testing now. I
think when they (new storage and real-time) are ready to be merged, we can
propose to jump the version to 3.0.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Li Yang <li...@apache.org> 于2019年1月1日周二 下午12:40写道：

> From the discussion, apparently a new storage will be added sooner or late.
>
> Will it be a new big version of Kylin? Like Apache Kylin 3.0? Also how
> about the migration from old storage? I assume old cube data has to be
> transformed and loaded into the new storage.
>
> Yang
>
> On Sat, Dec 29, 2018 at 5:52 PM ShaoFeng Shi <sh...@apache.org>
> wrote:
>
>> Thanks very much for Yiming and Jiatao's comments, they're very valueable.
>> There are many improvements can do for this new storage. We welcome all
>> kinds of contribution and would like to improve it together with the
>> community in the year of 2019!
>>
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>> Apache Kylin PMC
>> Work email: shaofeng.shi@kyligence.io
>> Kyligence Inc: https://kyligence.io/
>>
>> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
>> Join Kylin user mail group: user-subscribe@kylin.apache.org
>> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>>
>>
>>
>>
>> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午8:44写道：
>>
>> > Hi all,
>> >
>> > Truly agreed with Yiming, and here I expand a little more about
>> > "Distributed computing".
>> >
>> > As Yiming mentioned, Kylin will parse the query into an execution plan
>> > using Calcite(Kylin will change the execution plan cuz the data in
>> cubes is
>> > already aggregated, we cannot use the origin plan directly). It's a tree
>> > structure, a node represents a specific calculation and data goes from
>> > bottom to top applying all these calculations.
>> > [image: image.png]
>> > (Pic from https://blog.csdn.net/yu616568/article/details/50838504, a
>> > really good blog.)
>> >
>> > At present, Kylin will do almost all these calculations only in its own
>> > node, in other words, we cannot fully use the power of the cluster, and
>> > it's a SPOF. And here comes a design that we can visit this tree, *and
>> > transform each node into operations to Spark's Dataframes(i.e. "DF").*
>> >
>> > More specifically, we will visit the nodes recursively until we met the
>> > "TableScan" node(like a stack pushing operation). e.g. In the above
>> > diagram, the first node we met is a "Sort" node, we just visit its
>> > child(ren), and we'll not stop visiting each node's child(ren) until we
>> met
>> > a "TableScan" node.
>> >
>> > In the "TableScan" node, we will generate the initial DF, then the DF
>> will
>> > be poped to the "Filter" node, and the "Filter" node will apply its own
>> > operation like "df.filter(xxx)". Finally, we will apply each node's
>> > operation to this DF, and the final call chain will like:
>> > "df.filter(xxx).select(xxx).agg(xxx).sort(xxx)".
>> >
>> > After we got the final Dataframe and triggered the calculation, all the
>> > rest were handled by Spark. And we can gain tremendous benefits in
>> > computation level, more details can be seen in my previous post:
>> >
>> http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html
>> > .
>> >
>> >
>> > --
>> >
>> >
>> > Regards!
>> >
>> > Aron Tao
>> >
>> >
>> > 许益铭 <x1...@gmail.com> 于2018年12月19日周三 上午11:40写道：
>> >
>> >> hi All!
>> >> 关于CHAO LONG提到的几个问题,我有以下几个看法:
>> >>
>> >>
>> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
>> >> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime
>> >> 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
>> >> server GC严重
>> >>
>> >>
>> >>
>> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.
>> >>
>> >> 3.我们要使用parquet的page
>> >>
>> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
>> >> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.
>> >>
>> >> 我觉得使用spark来做我们的计算引擎能解决上述所有问题:
>> >>
>> >> 1.分布式计算
>> >> sql通过calcite解析优化之后会生成olap
>> >>
>> >>
>> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
>> >> server端的压力.
>> >>
>> >> 2.去掉字典
>> >>
>> >>
>> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.
>> >>
>> >> 3.parquet存储使用列的真实类型,而不是使用binary
>> >>
>> >>
>> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.
>> >>
>> >> 4.使用spark适配parquet
>> >> 当前的spark已经适配了parquet,spark的pushed
>> >> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
>> >> index能力.
>> >>
>> >> 5.index server
>> >> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
>> >> index的一种,因为我们可以在这里插入一个index server.
>> >>
>> >>
>> >> hi,all!
>> >> I have the following views:
>> >> 1. At present, our architecture is divided into two layers, one is the
>> >> storage layer, and the other is the computing layer. In the storage
>> layer,
>> >> we have made some optimizations and do pre-aggregation in the storage
>> >> layer
>> >> to reduce the amount of data returned. However, the aggregation and
>> >> connection of the runtime occurs on the kylin server side.
>> Serialization
>> >> is
>> >> inevitable, and this architecture is easy to cause a single point
>> >> bottleneck. If the agg or join data of the runtime is relatively large,
>> >> the
>> >> query performance will drop linearly, and the kylin server GC will be
>> >> severe.
>> >>
>> >> 2. As for the dictionary problem, canceling dictionary encoding is a
>> good
>> >> choice. The dictionary was originally designed to align rowkey in hbase
>> >> and
>> >> also to reduce part of the storage. But this also introduces another
>> >> problem, it is difficult to handle non-fixed string type dimension If
>> you
>> >> encounter a UHC dimension, you can only create a large dictionary or
>> give
>> >> a
>> >> larger fix-length, which causes the storage to double, and because the
>> >> dictionary is large, the query performance will be greatly affected. We
>> >> use
>> >> columnar storage, we don't need to consider this problem.
>> >>
>> >> 3. We need to use the page index of the parquet, we must convert the
>> tuple
>> >> filter into the filter of the parquet. This workload is not small. And
>> our
>> >> data is encoded. The page index of the parquet will only be based on
>> the
>> >> min and max value on the page. Filtering, so for binary data, it is
>> >> impossible to do filter.
>> >>
>> >> I think using spark to do our calculation engine solves all of the
>> above
>> >> problems:
>> >>
>> >> Distributed computing
>> >> Sql through calcite analysis optimization will generate a tree of OLAP
>> >> rel,
>> >> and spark's catalyst is also generated by parsing SQL after a tree,
>> >> automatically optimized to become a dataframe to calculate, if the
>> plan of
>> >> calcite can be converted into a spark plan, then we will achieve
>> >> distributed computing, calcite is only responsible for parsing SQL and
>> >> returning result sets, reducing the pressure on the kylin server side.
>> >>
>> >> 2. Remove the dictionary
>> >> The dictionary has a very good effect to reduce the storage pressure in
>> >> the
>> >> low and medium base, but there is also a disadvantage that its data
>> files
>> >> can not be used separately from the dictionary. I suggest that you can
>> use
>> >> the page level of the dictionary without considering the dictionary
>> type
>> >> encoding.
>> >>
>> >> 3.parquet storage uses the true type of the column instead of using
>> binary
>> >> As above, parquet has a very weak filter capability for binary, and the
>> >> basic type can directly use spark's Vectorizedread to speed up data
>> >> reading
>> >> speed and calculation.
>> >>
>> >> 4. Use spark to match the parquet
>> >> The current spark has been adapted to the parquet. The sparked filter
>> of
>> >> the spark has been converted into a filter that can be used by the
>> >> parquet.
>> >> Here, you only need to upgrade the version of the parcel and modify it
>> to
>> >> provide the page index of the parquet.
>> >>
>> >> 5.index server
>> >> As described by JiaTao Tao, the index server is divided into file index
>> >> and
>> >> page index. The filtering of the dictionary is nothing but a file
>> index,
>> >> because we can insert an index server here.
>> >>
>> >> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午4:45写道：
>> >>
>> >> > Hi Gang
>> >> >
>> >> > In my opinion, segments/partition pruning is actually in the scope of
>> >> > "Index system", we can have an "Index system" in storage level
>> including
>> >> > File index(for segment/partition pruning), page index(for page
>> pruning)
>> >> > etc. We can put all these stuff in such a system and make the
>> >> separation of
>> >> > duties cleaner.
>> >> >
>> >> >
>> >> > Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：
>> >> >
>> >> > > Awesome! Looking forward to the improvement. For dictionary, keep
>> the
>> >> > > dictionary in query engine, most time is not good since it brings
>> >> lots of
>> >> > > pressure to Kylin server, but sometimes it has benefit, for
>> example,
>> >> some
>> >> > > segments can be pruned very early when filter value is not in the
>> >> > > dictionary, and some queries can be answer directly using
>> dictionary
>> >> as
>> >> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
>> >> > >
>> >> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org>
>> >> wrote:
>> >> > >
>> >> > > The dimension dictionary is a legacy design for HBase storage I
>> think;
>> >> > > because HBase has no data type, everything is a byte array, this
>> makes
>> >> > > Kylin has to encode STRING and other types with some encoding
>> method
>> >> like
>> >> > > the dictionary.
>> >> > >
>> >> > > Now with the storage like Parquet, it would decide how to encode
>> the
>> >> data
>> >> > > at the page or block level. Then we can drop the dictionary after
>> the
>> >> > cube
>> >> > > is built. This will release the memory pressure of Kylin query
>> nodes
>> >> and
>> >> > > also benefit the UHC case.
>> >> > >
>> >> > > Best regards,
>> >> > >
>> >> > > Shaofeng Shi 史少锋
>> >> > > Apache Kylin PMC
>> >> > > Work email: shaofeng.shi@kyligence.io
>> >> > > Kyligence Inc: https://kyligence.io/
>> >> > >
>> >> > > Apache Kylin FAQ:
>> >> https://kylin.apache.org/docs/gettingstarted/faq.html
>> >> > > Join Kylin user mail group: user-subscribe@kylin.apache.org
>> >> > > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
>> >> > >
>> >> > >>  In this PoC, we verified Kylin On Parquet is viable, but the
>> query
>> >> > >> performance still have room to improve. We can improve it from the
>> >> > >> following aspects:
>> >> > >>
>> >> > >>  1, Minimize result set serialization time
>> >> > >>  Since Kylin need Object[] data to process, we convert Dataset to
>> >> RDD,
>> >> > >> and then convert the "Row" type to Object[], so Spark need to
>> >> serialize
>> >> > >> Object[] before return it to driver. Those time need to be
>> avoided.
>> >> > >>
>> >> > >>  2, Query without dictionary
>> >> > >>  In this PoC, for less storage use, we keep dict encode value in
>> >> Parquet
>> >> > >> file for dict-encode dimensions, so Kylin must load dictionary to
>> >> > convert
>> >> > >> dict value for query. If we keep original value for dict-encode
>> >> > dimension,
>> >> > >> dictionary is unnecessary. And we don't hava to worry about the
>> >> storage
>> >> > >> use, because Parquet will encode it. We should remove dictionary
>> from
>> >> > query.
>> >> > >>
>> >> > >>  3, Remove query single-point issue
>> >> > >>  In this PoC, we use Spark to read and process Cube data, which is
>> >> > >> distributed, but kylin alse need to process result data the Spark
>> >> > returned
>> >> > >> in single jvm. We can try to make it distributed too.
>> >> > >>
>> >> > >>  4, Upgrade Parquet to 1.11 for page index
>> >> > >>  In this PoC, Parquet don't have page index, we get a poor filter
>> >> > >> performance. We need to upgrade Parquet to version 1.11 which has
>> >> page
>> >> > >> index to improve filter performance.
>> >> > >>
>> >> > >> ------------------
>> >> > >> Best Regards,
>> >> > >> Chao Long
>> >> > >>
>> >> > >> ------------------ 原始邮件 ------------------
>> >> > >> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
>> >> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> >> > >> *收件人:* "dev"<de...@kylin.apache.org>;
>> >> > >> *主题:* Evaluate Kylin on Parquet
>> >> > >>
>> >> > >> Hello Kylin users,
>> >> > >>
>> >> > >> The first version of Kylin on Parquet [1] feature has been staged
>> in
>> >> > >> Kylin code repository for public review and evaluation. You can
>> check
>> >> > out
>> >> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
>> >> make a
>> >> > >> binary build to run an example. When creating a cube, you can
>> select
>> >> > >> "Parquet" as the storage in the "Advanced setting" page. Both
>> >> MapReduce
>> >> > and
>> >> > >> Spark engines support this new storage. A tech blog is under
>> drafting
>> >> > for
>> >> > >> the design and implementation.
>> >> > >>
>> >> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
>> >> Zhou!
>> >> > >>
>> >> > >> This is not the final version; there is room to improve in many
>> >> aspects,
>> >> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
>> >> Your
>> >> > >> comments are welcomed. Let's improve it together.
>> >> > >>
>> >> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> >> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>> >> > >>
>> >> > >> Best regards,
>> >> > >>
>> >> > >> Shaofeng Shi 史少锋
>> >> > >> Apache Kylin PMC
>> >> > >> Work email: shaofeng.shi@kyligence.io
>> >> > >> Kyligence Inc: https://kyligence.io/
>> >> > >>
>> >> > >> Apache Kylin FAQ:
>> >> https://kylin.apache.org/docs/gettingstarted/faq.html
>> >> > >> Join Kylin user mail group: user-subscribe@kylin.apache.org
>> >> > >> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >
>> >> > >
>> >> > >
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> >
>> >> > Regards!
>> >> >
>> >> > Aron Tao
>> >> >
>> >>
>> >
>> >
>>
>

Re: Re: Evaluate Kylin on Parquet

Posted by Li Yang <li...@apache.org>.

From the discussion, apparently a new storage will be added sooner or late.

Will it be a new big version of Kylin? Like Apache Kylin 3.0? Also how
about the migration from old storage? I assume old cube data has to be
transformed and loaded into the new storage.

Yang

On Sat, Dec 29, 2018 at 5:52 PM ShaoFeng Shi <sh...@apache.org> wrote:

> Thanks very much for Yiming and Jiatao's comments, they're very valueable.
> There are many improvements can do for this new storage. We welcome all
> kinds of contribution and would like to improve it together with the
> community in the year of 2019!
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Work email: shaofeng.shi@kyligence.io
> Kyligence Inc: https://kyligence.io/
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午8:44写道：
>
> > Hi all,
> >
> > Truly agreed with Yiming, and here I expand a little more about
> > "Distributed computing".
> >
> > As Yiming mentioned, Kylin will parse the query into an execution plan
> > using Calcite(Kylin will change the execution plan cuz the data in cubes
> is
> > already aggregated, we cannot use the origin plan directly). It's a tree
> > structure, a node represents a specific calculation and data goes from
> > bottom to top applying all these calculations.
> > [image: image.png]
> > (Pic from https://blog.csdn.net/yu616568/article/details/50838504, a
> > really good blog.)
> >
> > At present, Kylin will do almost all these calculations only in its own
> > node, in other words, we cannot fully use the power of the cluster, and
> > it's a SPOF. And here comes a design that we can visit this tree, *and
> > transform each node into operations to Spark's Dataframes(i.e. "DF").*
> >
> > More specifically, we will visit the nodes recursively until we met the
> > "TableScan" node(like a stack pushing operation). e.g. In the above
> > diagram, the first node we met is a "Sort" node, we just visit its
> > child(ren), and we'll not stop visiting each node's child(ren) until we
> met
> > a "TableScan" node.
> >
> > In the "TableScan" node, we will generate the initial DF, then the DF
> will
> > be poped to the "Filter" node, and the "Filter" node will apply its own
> > operation like "df.filter(xxx)". Finally, we will apply each node's
> > operation to this DF, and the final call chain will like:
> > "df.filter(xxx).select(xxx).agg(xxx).sort(xxx)".
> >
> > After we got the final Dataframe and triggered the calculation, all the
> > rest were handled by Spark. And we can gain tremendous benefits in
> > computation level, more details can be seen in my previous post:
> >
> http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html
> > .
> >
> >
> > --
> >
> >
> > Regards!
> >
> > Aron Tao
> >
> >
> > 许益铭 <x1...@gmail.com> 于2018年12月19日周三 上午11:40写道：
> >
> >> hi All!
> >> 关于CHAO LONG提到的几个问题,我有以下几个看法:
> >>
> >>
> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
> >> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime
> >> 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
> >> server GC严重
> >>
> >>
> >>
> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.
> >>
> >> 3.我们要使用parquet的page
> >>
> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
> >> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.
> >>
> >> 我觉得使用spark来做我们的计算引擎能解决上述所有问题:
> >>
> >> 1.分布式计算
> >> sql通过calcite解析优化之后会生成olap
> >>
> >>
> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
> >> server端的压力.
> >>
> >> 2.去掉字典
> >>
> >>
> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.
> >>
> >> 3.parquet存储使用列的真实类型,而不是使用binary
> >>
> >>
> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.
> >>
> >> 4.使用spark适配parquet
> >> 当前的spark已经适配了parquet,spark的pushed
> >> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
> >> index能力.
> >>
> >> 5.index server
> >> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
> >> index的一种,因为我们可以在这里插入一个index server.
> >>
> >>
> >> hi,all!
> >> I have the following views:
> >> 1. At present, our architecture is divided into two layers, one is the
> >> storage layer, and the other is the computing layer. In the storage
> layer,
> >> we have made some optimizations and do pre-aggregation in the storage
> >> layer
> >> to reduce the amount of data returned. However, the aggregation and
> >> connection of the runtime occurs on the kylin server side. Serialization
> >> is
> >> inevitable, and this architecture is easy to cause a single point
> >> bottleneck. If the agg or join data of the runtime is relatively large,
> >> the
> >> query performance will drop linearly, and the kylin server GC will be
> >> severe.
> >>
> >> 2. As for the dictionary problem, canceling dictionary encoding is a
> good
> >> choice. The dictionary was originally designed to align rowkey in hbase
> >> and
> >> also to reduce part of the storage. But this also introduces another
> >> problem, it is difficult to handle non-fixed string type dimension If
> you
> >> encounter a UHC dimension, you can only create a large dictionary or
> give
> >> a
> >> larger fix-length, which causes the storage to double, and because the
> >> dictionary is large, the query performance will be greatly affected. We
> >> use
> >> columnar storage, we don't need to consider this problem.
> >>
> >> 3. We need to use the page index of the parquet, we must convert the
> tuple
> >> filter into the filter of the parquet. This workload is not small. And
> our
> >> data is encoded. The page index of the parquet will only be based on the
> >> min and max value on the page. Filtering, so for binary data, it is
> >> impossible to do filter.
> >>
> >> I think using spark to do our calculation engine solves all of the above
> >> problems:
> >>
> >> Distributed computing
> >> Sql through calcite analysis optimization will generate a tree of OLAP
> >> rel,
> >> and spark's catalyst is also generated by parsing SQL after a tree,
> >> automatically optimized to become a dataframe to calculate, if the plan
> of
> >> calcite can be converted into a spark plan, then we will achieve
> >> distributed computing, calcite is only responsible for parsing SQL and
> >> returning result sets, reducing the pressure on the kylin server side.
> >>
> >> 2. Remove the dictionary
> >> The dictionary has a very good effect to reduce the storage pressure in
> >> the
> >> low and medium base, but there is also a disadvantage that its data
> files
> >> can not be used separately from the dictionary. I suggest that you can
> use
> >> the page level of the dictionary without considering the dictionary type
> >> encoding.
> >>
> >> 3.parquet storage uses the true type of the column instead of using
> binary
> >> As above, parquet has a very weak filter capability for binary, and the
> >> basic type can directly use spark's Vectorizedread to speed up data
> >> reading
> >> speed and calculation.
> >>
> >> 4. Use spark to match the parquet
> >> The current spark has been adapted to the parquet. The sparked filter of
> >> the spark has been converted into a filter that can be used by the
> >> parquet.
> >> Here, you only need to upgrade the version of the parcel and modify it
> to
> >> provide the page index of the parquet.
> >>
> >> 5.index server
> >> As described by JiaTao Tao, the index server is divided into file index
> >> and
> >> page index. The filtering of the dictionary is nothing but a file index,
> >> because we can insert an index server here.
> >>
> >> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午4:45写道：
> >>
> >> > Hi Gang
> >> >
> >> > In my opinion, segments/partition pruning is actually in the scope of
> >> > "Index system", we can have an "Index system" in storage level
> including
> >> > File index(for segment/partition pruning), page index(for page
> pruning)
> >> > etc. We can put all these stuff in such a system and make the
> >> separation of
> >> > duties cleaner.
> >> >
> >> >
> >> > Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：
> >> >
> >> > > Awesome! Looking forward to the improvement. For dictionary, keep
> the
> >> > > dictionary in query engine, most time is not good since it brings
> >> lots of
> >> > > pressure to Kylin server, but sometimes it has benefit, for example,
> >> some
> >> > > segments can be pruned very early when filter value is not in the
> >> > > dictionary, and some queries can be answer directly using dictionary
> >> as
> >> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
> >> > >
> >> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org>
> >> wrote:
> >> > >
> >> > > The dimension dictionary is a legacy design for HBase storage I
> think;
> >> > > because HBase has no data type, everything is a byte array, this
> makes
> >> > > Kylin has to encode STRING and other types with some encoding method
> >> like
> >> > > the dictionary.
> >> > >
> >> > > Now with the storage like Parquet, it would decide how to encode the
> >> data
> >> > > at the page or block level. Then we can drop the dictionary after
> the
> >> > cube
> >> > > is built. This will release the memory pressure of Kylin query nodes
> >> and
> >> > > also benefit the UHC case.
> >> > >
> >> > > Best regards,
> >> > >
> >> > > Shaofeng Shi 史少锋
> >> > > Apache Kylin PMC
> >> > > Work email: shaofeng.shi@kyligence.io
> >> > > Kyligence Inc: https://kyligence.io/
> >> > >
> >> > > Apache Kylin FAQ:
> >> https://kylin.apache.org/docs/gettingstarted/faq.html
> >> > > Join Kylin user mail group: user-subscribe@kylin.apache.org
> >> > > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
> >> > >
> >> > >>  In this PoC, we verified Kylin On Parquet is viable, but the query
> >> > >> performance still have room to improve. We can improve it from the
> >> > >> following aspects:
> >> > >>
> >> > >>  1, Minimize result set serialization time
> >> > >>  Since Kylin need Object[] data to process, we convert Dataset to
> >> RDD,
> >> > >> and then convert the "Row" type to Object[], so Spark need to
> >> serialize
> >> > >> Object[] before return it to driver. Those time need to be avoided.
> >> > >>
> >> > >>  2, Query without dictionary
> >> > >>  In this PoC, for less storage use, we keep dict encode value in
> >> Parquet
> >> > >> file for dict-encode dimensions, so Kylin must load dictionary to
> >> > convert
> >> > >> dict value for query. If we keep original value for dict-encode
> >> > dimension,
> >> > >> dictionary is unnecessary. And we don't hava to worry about the
> >> storage
> >> > >> use, because Parquet will encode it. We should remove dictionary
> from
> >> > query.
> >> > >>
> >> > >>  3, Remove query single-point issue
> >> > >>  In this PoC, we use Spark to read and process Cube data, which is
> >> > >> distributed, but kylin alse need to process result data the Spark
> >> > returned
> >> > >> in single jvm. We can try to make it distributed too.
> >> > >>
> >> > >>  4, Upgrade Parquet to 1.11 for page index
> >> > >>  In this PoC, Parquet don't have page index, we get a poor filter
> >> > >> performance. We need to upgrade Parquet to version 1.11 which has
> >> page
> >> > >> index to improve filter performance.
> >> > >>
> >> > >> ------------------
> >> > >> Best Regards,
> >> > >> Chao Long
> >> > >>
> >> > >> ------------------ 原始邮件 ------------------
> >> > >> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
> >> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
> >> > >> *收件人:* "dev"<de...@kylin.apache.org>;
> >> > >> *主题:* Evaluate Kylin on Parquet
> >> > >>
> >> > >> Hello Kylin users,
> >> > >>
> >> > >> The first version of Kylin on Parquet [1] feature has been staged
> in
> >> > >> Kylin code repository for public review and evaluation. You can
> check
> >> > out
> >> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
> >> make a
> >> > >> binary build to run an example. When creating a cube, you can
> select
> >> > >> "Parquet" as the storage in the "Advanced setting" page. Both
> >> MapReduce
> >> > and
> >> > >> Spark engines support this new storage. A tech blog is under
> drafting
> >> > for
> >> > >> the design and implementation.
> >> > >>
> >> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
> >> Zhou!
> >> > >>
> >> > >> This is not the final version; there is room to improve in many
> >> aspects,
> >> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
> >> Your
> >> > >> comments are welcomed. Let's improve it together.
> >> > >>
> >> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
> >> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
> >> > >>
> >> > >> Best regards,
> >> > >>
> >> > >> Shaofeng Shi 史少锋
> >> > >> Apache Kylin PMC
> >> > >> Work email: shaofeng.shi@kyligence.io
> >> > >> Kyligence Inc: https://kyligence.io/
> >> > >>
> >> > >> Apache Kylin FAQ:
> >> https://kylin.apache.org/docs/gettingstarted/faq.html
> >> > >> Join Kylin user mail group: user-subscribe@kylin.apache.org
> >> > >> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> >> > >>
> >> > >>
> >> > >>
> >> > >
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> >
> >> >
> >> > Regards!
> >> >
> >> > Aron Tao
> >> >
> >>
> >
> >
>

Re: Re: Evaluate Kylin on Parquet

Posted by Li Yang <li...@apache.org>.

From the discussion, apparently a new storage will be added sooner or late.

Will it be a new big version of Kylin? Like Apache Kylin 3.0? Also how
about the migration from old storage? I assume old cube data has to be
transformed and loaded into the new storage.

Yang

On Sat, Dec 29, 2018 at 5:52 PM ShaoFeng Shi <sh...@apache.org> wrote:

> Thanks very much for Yiming and Jiatao's comments, they're very valueable.
> There are many improvements can do for this new storage. We welcome all
> kinds of contribution and would like to improve it together with the
> community in the year of 2019!
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Work email: shaofeng.shi@kyligence.io
> Kyligence Inc: https://kyligence.io/
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午8:44写道：
>
> > Hi all,
> >
> > Truly agreed with Yiming, and here I expand a little more about
> > "Distributed computing".
> >
> > As Yiming mentioned, Kylin will parse the query into an execution plan
> > using Calcite(Kylin will change the execution plan cuz the data in cubes
> is
> > already aggregated, we cannot use the origin plan directly). It's a tree
> > structure, a node represents a specific calculation and data goes from
> > bottom to top applying all these calculations.
> > [image: image.png]
> > (Pic from https://blog.csdn.net/yu616568/article/details/50838504, a
> > really good blog.)
> >
> > At present, Kylin will do almost all these calculations only in its own
> > node, in other words, we cannot fully use the power of the cluster, and
> > it's a SPOF. And here comes a design that we can visit this tree, *and
> > transform each node into operations to Spark's Dataframes(i.e. "DF").*
> >
> > More specifically, we will visit the nodes recursively until we met the
> > "TableScan" node(like a stack pushing operation). e.g. In the above
> > diagram, the first node we met is a "Sort" node, we just visit its
> > child(ren), and we'll not stop visiting each node's child(ren) until we
> met
> > a "TableScan" node.
> >
> > In the "TableScan" node, we will generate the initial DF, then the DF
> will
> > be poped to the "Filter" node, and the "Filter" node will apply its own
> > operation like "df.filter(xxx)". Finally, we will apply each node's
> > operation to this DF, and the final call chain will like:
> > "df.filter(xxx).select(xxx).agg(xxx).sort(xxx)".
> >
> > After we got the final Dataframe and triggered the calculation, all the
> > rest were handled by Spark. And we can gain tremendous benefits in
> > computation level, more details can be seen in my previous post:
> >
> http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html
> > .
> >
> >
> > --
> >
> >
> > Regards!
> >
> > Aron Tao
> >
> >
> > 许益铭 <x1...@gmail.com> 于2018年12月19日周三 上午11:40写道：
> >
> >> hi All!
> >> 关于CHAO LONG提到的几个问题,我有以下几个看法:
> >>
> >>
> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
> >> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime
> >> 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
> >> server GC严重
> >>
> >>
> >>
> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.
> >>
> >> 3.我们要使用parquet的page
> >>
> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
> >> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.
> >>
> >> 我觉得使用spark来做我们的计算引擎能解决上述所有问题:
> >>
> >> 1.分布式计算
> >> sql通过calcite解析优化之后会生成olap
> >>
> >>
> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
> >> server端的压力.
> >>
> >> 2.去掉字典
> >>
> >>
> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.
> >>
> >> 3.parquet存储使用列的真实类型,而不是使用binary
> >>
> >>
> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.
> >>
> >> 4.使用spark适配parquet
> >> 当前的spark已经适配了parquet,spark的pushed
> >> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
> >> index能力.
> >>
> >> 5.index server
> >> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
> >> index的一种,因为我们可以在这里插入一个index server.
> >>
> >>
> >> hi,all!
> >> I have the following views:
> >> 1. At present, our architecture is divided into two layers, one is the
> >> storage layer, and the other is the computing layer. In the storage
> layer,
> >> we have made some optimizations and do pre-aggregation in the storage
> >> layer
> >> to reduce the amount of data returned. However, the aggregation and
> >> connection of the runtime occurs on the kylin server side. Serialization
> >> is
> >> inevitable, and this architecture is easy to cause a single point
> >> bottleneck. If the agg or join data of the runtime is relatively large,
> >> the
> >> query performance will drop linearly, and the kylin server GC will be
> >> severe.
> >>
> >> 2. As for the dictionary problem, canceling dictionary encoding is a
> good
> >> choice. The dictionary was originally designed to align rowkey in hbase
> >> and
> >> also to reduce part of the storage. But this also introduces another
> >> problem, it is difficult to handle non-fixed string type dimension If
> you
> >> encounter a UHC dimension, you can only create a large dictionary or
> give
> >> a
> >> larger fix-length, which causes the storage to double, and because the
> >> dictionary is large, the query performance will be greatly affected. We
> >> use
> >> columnar storage, we don't need to consider this problem.
> >>
> >> 3. We need to use the page index of the parquet, we must convert the
> tuple
> >> filter into the filter of the parquet. This workload is not small. And
> our
> >> data is encoded. The page index of the parquet will only be based on the
> >> min and max value on the page. Filtering, so for binary data, it is
> >> impossible to do filter.
> >>
> >> I think using spark to do our calculation engine solves all of the above
> >> problems:
> >>
> >> Distributed computing
> >> Sql through calcite analysis optimization will generate a tree of OLAP
> >> rel,
> >> and spark's catalyst is also generated by parsing SQL after a tree,
> >> automatically optimized to become a dataframe to calculate, if the plan
> of
> >> calcite can be converted into a spark plan, then we will achieve
> >> distributed computing, calcite is only responsible for parsing SQL and
> >> returning result sets, reducing the pressure on the kylin server side.
> >>
> >> 2. Remove the dictionary
> >> The dictionary has a very good effect to reduce the storage pressure in
> >> the
> >> low and medium base, but there is also a disadvantage that its data
> files
> >> can not be used separately from the dictionary. I suggest that you can
> use
> >> the page level of the dictionary without considering the dictionary type
> >> encoding.
> >>
> >> 3.parquet storage uses the true type of the column instead of using
> binary
> >> As above, parquet has a very weak filter capability for binary, and the
> >> basic type can directly use spark's Vectorizedread to speed up data
> >> reading
> >> speed and calculation.
> >>
> >> 4. Use spark to match the parquet
> >> The current spark has been adapted to the parquet. The sparked filter of
> >> the spark has been converted into a filter that can be used by the
> >> parquet.
> >> Here, you only need to upgrade the version of the parcel and modify it
> to
> >> provide the page index of the parquet.
> >>
> >> 5.index server
> >> As described by JiaTao Tao, the index server is divided into file index
> >> and
> >> page index. The filtering of the dictionary is nothing but a file index,
> >> because we can insert an index server here.
> >>
> >> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午4:45写道：
> >>
> >> > Hi Gang
> >> >
> >> > In my opinion, segments/partition pruning is actually in the scope of
> >> > "Index system", we can have an "Index system" in storage level
> including
> >> > File index(for segment/partition pruning), page index(for page
> pruning)
> >> > etc. We can put all these stuff in such a system and make the
> >> separation of
> >> > duties cleaner.
> >> >
> >> >
> >> > Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：
> >> >
> >> > > Awesome! Looking forward to the improvement. For dictionary, keep
> the
> >> > > dictionary in query engine, most time is not good since it brings
> >> lots of
> >> > > pressure to Kylin server, but sometimes it has benefit, for example,
> >> some
> >> > > segments can be pruned very early when filter value is not in the
> >> > > dictionary, and some queries can be answer directly using dictionary
> >> as
> >> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
> >> > >
> >> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org>
> >> wrote:
> >> > >
> >> > > The dimension dictionary is a legacy design for HBase storage I
> think;
> >> > > because HBase has no data type, everything is a byte array, this
> makes
> >> > > Kylin has to encode STRING and other types with some encoding method
> >> like
> >> > > the dictionary.
> >> > >
> >> > > Now with the storage like Parquet, it would decide how to encode the
> >> data
> >> > > at the page or block level. Then we can drop the dictionary after
> the
> >> > cube
> >> > > is built. This will release the memory pressure of Kylin query nodes
> >> and
> >> > > also benefit the UHC case.
> >> > >
> >> > > Best regards,
> >> > >
> >> > > Shaofeng Shi 史少锋
> >> > > Apache Kylin PMC
> >> > > Work email: shaofeng.shi@kyligence.io
> >> > > Kyligence Inc: https://kyligence.io/
> >> > >
> >> > > Apache Kylin FAQ:
> >> https://kylin.apache.org/docs/gettingstarted/faq.html
> >> > > Join Kylin user mail group: user-subscribe@kylin.apache.org
> >> > > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
> >> > >
> >> > >>  In this PoC, we verified Kylin On Parquet is viable, but the query
> >> > >> performance still have room to improve. We can improve it from the
> >> > >> following aspects:
> >> > >>
> >> > >>  1, Minimize result set serialization time
> >> > >>  Since Kylin need Object[] data to process, we convert Dataset to
> >> RDD,
> >> > >> and then convert the "Row" type to Object[], so Spark need to
> >> serialize
> >> > >> Object[] before return it to driver. Those time need to be avoided.
> >> > >>
> >> > >>  2, Query without dictionary
> >> > >>  In this PoC, for less storage use, we keep dict encode value in
> >> Parquet
> >> > >> file for dict-encode dimensions, so Kylin must load dictionary to
> >> > convert
> >> > >> dict value for query. If we keep original value for dict-encode
> >> > dimension,
> >> > >> dictionary is unnecessary. And we don't hava to worry about the
> >> storage
> >> > >> use, because Parquet will encode it. We should remove dictionary
> from
> >> > query.
> >> > >>
> >> > >>  3, Remove query single-point issue
> >> > >>  In this PoC, we use Spark to read and process Cube data, which is
> >> > >> distributed, but kylin alse need to process result data the Spark
> >> > returned
> >> > >> in single jvm. We can try to make it distributed too.
> >> > >>
> >> > >>  4, Upgrade Parquet to 1.11 for page index
> >> > >>  In this PoC, Parquet don't have page index, we get a poor filter
> >> > >> performance. We need to upgrade Parquet to version 1.11 which has
> >> page
> >> > >> index to improve filter performance.
> >> > >>
> >> > >> ------------------
> >> > >> Best Regards,
> >> > >> Chao Long
> >> > >>
> >> > >> ------------------ 原始邮件 ------------------
> >> > >> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
> >> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
> >> > >> *收件人:* "dev"<de...@kylin.apache.org>;
> >> > >> *主题:* Evaluate Kylin on Parquet
> >> > >>
> >> > >> Hello Kylin users,
> >> > >>
> >> > >> The first version of Kylin on Parquet [1] feature has been staged
> in
> >> > >> Kylin code repository for public review and evaluation. You can
> check
> >> > out
> >> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
> >> make a
> >> > >> binary build to run an example. When creating a cube, you can
> select
> >> > >> "Parquet" as the storage in the "Advanced setting" page. Both
> >> MapReduce
> >> > and
> >> > >> Spark engines support this new storage. A tech blog is under
> drafting
> >> > for
> >> > >> the design and implementation.
> >> > >>
> >> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
> >> Zhou!
> >> > >>
> >> > >> This is not the final version; there is room to improve in many
> >> aspects,
> >> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
> >> Your
> >> > >> comments are welcomed. Let's improve it together.
> >> > >>
> >> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
> >> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
> >> > >>
> >> > >> Best regards,
> >> > >>
> >> > >> Shaofeng Shi 史少锋
> >> > >> Apache Kylin PMC
> >> > >> Work email: shaofeng.shi@kyligence.io
> >> > >> Kyligence Inc: https://kyligence.io/
> >> > >>
> >> > >> Apache Kylin FAQ:
> >> https://kylin.apache.org/docs/gettingstarted/faq.html
> >> > >> Join Kylin user mail group: user-subscribe@kylin.apache.org
> >> > >> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> >> > >>
> >> > >>
> >> > >>
> >> > >
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> >
> >> >
> >> > Regards!
> >> >
> >> > Aron Tao
> >> >
> >>
> >
> >
>

Re: Re: Evaluate Kylin on Parquet

Posted by ShaoFeng Shi <sh...@apache.org>.

Thanks very much for Yiming and Jiatao's comments, they're very valueable.
There are many improvements can do for this new storage. We welcome all
kinds of contribution and would like to improve it together with the
community in the year of 2019!

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午8:44写道：

> Hi all,
>
> Truly agreed with Yiming, and here I expand a little more about
> "Distributed computing".
>
> As Yiming mentioned, Kylin will parse the query into an execution plan
> using Calcite(Kylin will change the execution plan cuz the data in cubes is
> already aggregated, we cannot use the origin plan directly). It's a tree
> structure, a node represents a specific calculation and data goes from
> bottom to top applying all these calculations.
> [image: image.png]
> (Pic from https://blog.csdn.net/yu616568/article/details/50838504, a
> really good blog.)
>
> At present, Kylin will do almost all these calculations only in its own
> node, in other words, we cannot fully use the power of the cluster, and
> it's a SPOF. And here comes a design that we can visit this tree, *and
> transform each node into operations to Spark's Dataframes(i.e. "DF").*
>
> More specifically, we will visit the nodes recursively until we met the
> "TableScan" node(like a stack pushing operation). e.g. In the above
> diagram, the first node we met is a "Sort" node, we just visit its
> child(ren), and we'll not stop visiting each node's child(ren) until we met
> a "TableScan" node.
>
> In the "TableScan" node, we will generate the initial DF, then the DF will
> be poped to the "Filter" node, and the "Filter" node will apply its own
> operation like "df.filter(xxx)". Finally, we will apply each node's
> operation to this DF, and the final call chain will like:
> "df.filter(xxx).select(xxx).agg(xxx).sort(xxx)".
>
> After we got the final Dataframe and triggered the calculation, all the
> rest were handled by Spark. And we can gain tremendous benefits in
> computation level, more details can be seen in my previous post:
> http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html
> .
>
>
> --
>
>
> Regards!
>
> Aron Tao
>
>
> 许益铭 <x1...@gmail.com> 于2018年12月19日周三 上午11:40写道：
>
>> hi All!
>> 关于CHAO LONG提到的几个问题,我有以下几个看法:
>>
>> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
>> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime
>> 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
>> server GC严重
>>
>>
>> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.
>>
>> 3.我们要使用parquet的page
>> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
>> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.
>>
>> 我觉得使用spark来做我们的计算引擎能解决上述所有问题:
>>
>> 1.分布式计算
>> sql通过calcite解析优化之后会生成olap
>>
>> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
>> server端的压力.
>>
>> 2.去掉字典
>>
>> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.
>>
>> 3.parquet存储使用列的真实类型,而不是使用binary
>>
>> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.
>>
>> 4.使用spark适配parquet
>> 当前的spark已经适配了parquet,spark的pushed
>> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
>> index能力.
>>
>> 5.index server
>> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
>> index的一种,因为我们可以在这里插入一个index server.
>>
>>
>> hi,all!
>> I have the following views:
>> 1. At present, our architecture is divided into two layers, one is the
>> storage layer, and the other is the computing layer. In the storage layer,
>> we have made some optimizations and do pre-aggregation in the storage
>> layer
>> to reduce the amount of data returned. However, the aggregation and
>> connection of the runtime occurs on the kylin server side. Serialization
>> is
>> inevitable, and this architecture is easy to cause a single point
>> bottleneck. If the agg or join data of the runtime is relatively large,
>> the
>> query performance will drop linearly, and the kylin server GC will be
>> severe.
>>
>> 2. As for the dictionary problem, canceling dictionary encoding is a good
>> choice. The dictionary was originally designed to align rowkey in hbase
>> and
>> also to reduce part of the storage. But this also introduces another
>> problem, it is difficult to handle non-fixed string type dimension If you
>> encounter a UHC dimension, you can only create a large dictionary or give
>> a
>> larger fix-length, which causes the storage to double, and because the
>> dictionary is large, the query performance will be greatly affected. We
>> use
>> columnar storage, we don't need to consider this problem.
>>
>> 3. We need to use the page index of the parquet, we must convert the tuple
>> filter into the filter of the parquet. This workload is not small. And our
>> data is encoded. The page index of the parquet will only be based on the
>> min and max value on the page. Filtering, so for binary data, it is
>> impossible to do filter.
>>
>> I think using spark to do our calculation engine solves all of the above
>> problems:
>>
>> Distributed computing
>> Sql through calcite analysis optimization will generate a tree of OLAP
>> rel,
>> and spark's catalyst is also generated by parsing SQL after a tree,
>> automatically optimized to become a dataframe to calculate, if the plan of
>> calcite can be converted into a spark plan, then we will achieve
>> distributed computing, calcite is only responsible for parsing SQL and
>> returning result sets, reducing the pressure on the kylin server side.
>>
>> 2. Remove the dictionary
>> The dictionary has a very good effect to reduce the storage pressure in
>> the
>> low and medium base, but there is also a disadvantage that its data files
>> can not be used separately from the dictionary. I suggest that you can use
>> the page level of the dictionary without considering the dictionary type
>> encoding.
>>
>> 3.parquet storage uses the true type of the column instead of using binary
>> As above, parquet has a very weak filter capability for binary, and the
>> basic type can directly use spark's Vectorizedread to speed up data
>> reading
>> speed and calculation.
>>
>> 4. Use spark to match the parquet
>> The current spark has been adapted to the parquet. The sparked filter of
>> the spark has been converted into a filter that can be used by the
>> parquet.
>> Here, you only need to upgrade the version of the parcel and modify it to
>> provide the page index of the parquet.
>>
>> 5.index server
>> As described by JiaTao Tao, the index server is divided into file index
>> and
>> page index. The filtering of the dictionary is nothing but a file index,
>> because we can insert an index server here.
>>
>> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午4:45写道：
>>
>> > Hi Gang
>> >
>> > In my opinion, segments/partition pruning is actually in the scope of
>> > "Index system", we can have an "Index system" in storage level including
>> > File index(for segment/partition pruning), page index(for page pruning)
>> > etc. We can put all these stuff in such a system and make the
>> separation of
>> > duties cleaner.
>> >
>> >
>> > Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：
>> >
>> > > Awesome! Looking forward to the improvement. For dictionary, keep the
>> > > dictionary in query engine, most time is not good since it brings
>> lots of
>> > > pressure to Kylin server, but sometimes it has benefit, for example,
>> some
>> > > segments can be pruned very early when filter value is not in the
>> > > dictionary, and some queries can be answer directly using dictionary
>> as
>> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
>> > >
>> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org>
>> wrote:
>> > >
>> > > The dimension dictionary is a legacy design for HBase storage I think;
>> > > because HBase has no data type, everything is a byte array, this makes
>> > > Kylin has to encode STRING and other types with some encoding method
>> like
>> > > the dictionary.
>> > >
>> > > Now with the storage like Parquet, it would decide how to encode the
>> data
>> > > at the page or block level. Then we can drop the dictionary after the
>> > cube
>> > > is built. This will release the memory pressure of Kylin query nodes
>> and
>> > > also benefit the UHC case.
>> > >
>> > > Best regards,
>> > >
>> > > Shaofeng Shi 史少锋
>> > > Apache Kylin PMC
>> > > Work email: shaofeng.shi@kyligence.io
>> > > Kyligence Inc: https://kyligence.io/
>> > >
>> > > Apache Kylin FAQ:
>> https://kylin.apache.org/docs/gettingstarted/faq.html
>> > > Join Kylin user mail group: user-subscribe@kylin.apache.org
>> > > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>> > >
>> > >
>> > >
>> > >
>> > > Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
>> > >
>> > >>  In this PoC, we verified Kylin On Parquet is viable, but the query
>> > >> performance still have room to improve. We can improve it from the
>> > >> following aspects:
>> > >>
>> > >>  1, Minimize result set serialization time
>> > >>  Since Kylin need Object[] data to process, we convert Dataset to
>> RDD,
>> > >> and then convert the "Row" type to Object[], so Spark need to
>> serialize
>> > >> Object[] before return it to driver. Those time need to be avoided.
>> > >>
>> > >>  2, Query without dictionary
>> > >>  In this PoC, for less storage use, we keep dict encode value in
>> Parquet
>> > >> file for dict-encode dimensions, so Kylin must load dictionary to
>> > convert
>> > >> dict value for query. If we keep original value for dict-encode
>> > dimension,
>> > >> dictionary is unnecessary. And we don't hava to worry about the
>> storage
>> > >> use, because Parquet will encode it. We should remove dictionary from
>> > query.
>> > >>
>> > >>  3, Remove query single-point issue
>> > >>  In this PoC, we use Spark to read and process Cube data, which is
>> > >> distributed, but kylin alse need to process result data the Spark
>> > returned
>> > >> in single jvm. We can try to make it distributed too.
>> > >>
>> > >>  4, Upgrade Parquet to 1.11 for page index
>> > >>  In this PoC, Parquet don't have page index, we get a poor filter
>> > >> performance. We need to upgrade Parquet to version 1.11 which has
>> page
>> > >> index to improve filter performance.
>> > >>
>> > >> ------------------
>> > >> Best Regards,
>> > >> Chao Long
>> > >>
>> > >> ------------------ 原始邮件 ------------------
>> > >> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
>> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> > >> *收件人:* "dev"<de...@kylin.apache.org>;
>> > >> *主题:* Evaluate Kylin on Parquet
>> > >>
>> > >> Hello Kylin users,
>> > >>
>> > >> The first version of Kylin on Parquet [1] feature has been staged in
>> > >> Kylin code repository for public review and evaluation. You can check
>> > out
>> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
>> make a
>> > >> binary build to run an example. When creating a cube, you can select
>> > >> "Parquet" as the storage in the "Advanced setting" page. Both
>> MapReduce
>> > and
>> > >> Spark engines support this new storage. A tech blog is under drafting
>> > for
>> > >> the design and implementation.
>> > >>
>> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
>> Zhou!
>> > >>
>> > >> This is not the final version; there is room to improve in many
>> aspects,
>> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
>> Your
>> > >> comments are welcomed. Let's improve it together.
>> > >>
>> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>> > >>
>> > >> Best regards,
>> > >>
>> > >> Shaofeng Shi 史少锋
>> > >> Apache Kylin PMC
>> > >> Work email: shaofeng.shi@kyligence.io
>> > >> Kyligence Inc: https://kyligence.io/
>> > >>
>> > >> Apache Kylin FAQ:
>> https://kylin.apache.org/docs/gettingstarted/faq.html
>> > >> Join Kylin user mail group: user-subscribe@kylin.apache.org
>> > >> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>> > >>
>> > >>
>> > >>
>> > >
>> > >
>> > >
>> >
>> >
>> > --
>> >
>> >
>> > Regards!
>> >
>> > Aron Tao
>> >
>>
>
>

Re: Re: Evaluate Kylin on Parquet

Posted by ShaoFeng Shi <sh...@apache.org>.

Thanks very much for Yiming and Jiatao's comments, they're very valueable.
There are many improvements can do for this new storage. We welcome all
kinds of contribution and would like to improve it together with the
community in the year of 2019!

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午8:44写道：

> Hi all,
>
> Truly agreed with Yiming, and here I expand a little more about
> "Distributed computing".
>
> As Yiming mentioned, Kylin will parse the query into an execution plan
> using Calcite(Kylin will change the execution plan cuz the data in cubes is
> already aggregated, we cannot use the origin plan directly). It's a tree
> structure, a node represents a specific calculation and data goes from
> bottom to top applying all these calculations.
> [image: image.png]
> (Pic from https://blog.csdn.net/yu616568/article/details/50838504, a
> really good blog.)
>
> At present, Kylin will do almost all these calculations only in its own
> node, in other words, we cannot fully use the power of the cluster, and
> it's a SPOF. And here comes a design that we can visit this tree, *and
> transform each node into operations to Spark's Dataframes(i.e. "DF").*
>
> More specifically, we will visit the nodes recursively until we met the
> "TableScan" node(like a stack pushing operation). e.g. In the above
> diagram, the first node we met is a "Sort" node, we just visit its
> child(ren), and we'll not stop visiting each node's child(ren) until we met
> a "TableScan" node.
>
> In the "TableScan" node, we will generate the initial DF, then the DF will
> be poped to the "Filter" node, and the "Filter" node will apply its own
> operation like "df.filter(xxx)". Finally, we will apply each node's
> operation to this DF, and the final call chain will like:
> "df.filter(xxx).select(xxx).agg(xxx).sort(xxx)".
>
> After we got the final Dataframe and triggered the calculation, all the
> rest were handled by Spark. And we can gain tremendous benefits in
> computation level, more details can be seen in my previous post:
> http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html
> .
>
>
> --
>
>
> Regards!
>
> Aron Tao
>
>
> 许益铭 <x1...@gmail.com> 于2018年12月19日周三 上午11:40写道：
>
>> hi All!
>> 关于CHAO LONG提到的几个问题,我有以下几个看法:
>>
>> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
>> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime
>> 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
>> server GC严重
>>
>>
>> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.
>>
>> 3.我们要使用parquet的page
>> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
>> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.
>>
>> 我觉得使用spark来做我们的计算引擎能解决上述所有问题:
>>
>> 1.分布式计算
>> sql通过calcite解析优化之后会生成olap
>>
>> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
>> server端的压力.
>>
>> 2.去掉字典
>>
>> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.
>>
>> 3.parquet存储使用列的真实类型,而不是使用binary
>>
>> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.
>>
>> 4.使用spark适配parquet
>> 当前的spark已经适配了parquet,spark的pushed
>> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
>> index能力.
>>
>> 5.index server
>> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
>> index的一种,因为我们可以在这里插入一个index server.
>>
>>
>> hi,all!
>> I have the following views:
>> 1. At present, our architecture is divided into two layers, one is the
>> storage layer, and the other is the computing layer. In the storage layer,
>> we have made some optimizations and do pre-aggregation in the storage
>> layer
>> to reduce the amount of data returned. However, the aggregation and
>> connection of the runtime occurs on the kylin server side. Serialization
>> is
>> inevitable, and this architecture is easy to cause a single point
>> bottleneck. If the agg or join data of the runtime is relatively large,
>> the
>> query performance will drop linearly, and the kylin server GC will be
>> severe.
>>
>> 2. As for the dictionary problem, canceling dictionary encoding is a good
>> choice. The dictionary was originally designed to align rowkey in hbase
>> and
>> also to reduce part of the storage. But this also introduces another
>> problem, it is difficult to handle non-fixed string type dimension If you
>> encounter a UHC dimension, you can only create a large dictionary or give
>> a
>> larger fix-length, which causes the storage to double, and because the
>> dictionary is large, the query performance will be greatly affected. We
>> use
>> columnar storage, we don't need to consider this problem.
>>
>> 3. We need to use the page index of the parquet, we must convert the tuple
>> filter into the filter of the parquet. This workload is not small. And our
>> data is encoded. The page index of the parquet will only be based on the
>> min and max value on the page. Filtering, so for binary data, it is
>> impossible to do filter.
>>
>> I think using spark to do our calculation engine solves all of the above
>> problems:
>>
>> Distributed computing
>> Sql through calcite analysis optimization will generate a tree of OLAP
>> rel,
>> and spark's catalyst is also generated by parsing SQL after a tree,
>> automatically optimized to become a dataframe to calculate, if the plan of
>> calcite can be converted into a spark plan, then we will achieve
>> distributed computing, calcite is only responsible for parsing SQL and
>> returning result sets, reducing the pressure on the kylin server side.
>>
>> 2. Remove the dictionary
>> The dictionary has a very good effect to reduce the storage pressure in
>> the
>> low and medium base, but there is also a disadvantage that its data files
>> can not be used separately from the dictionary. I suggest that you can use
>> the page level of the dictionary without considering the dictionary type
>> encoding.
>>
>> 3.parquet storage uses the true type of the column instead of using binary
>> As above, parquet has a very weak filter capability for binary, and the
>> basic type can directly use spark's Vectorizedread to speed up data
>> reading
>> speed and calculation.
>>
>> 4. Use spark to match the parquet
>> The current spark has been adapted to the parquet. The sparked filter of
>> the spark has been converted into a filter that can be used by the
>> parquet.
>> Here, you only need to upgrade the version of the parcel and modify it to
>> provide the page index of the parquet.
>>
>> 5.index server
>> As described by JiaTao Tao, the index server is divided into file index
>> and
>> page index. The filtering of the dictionary is nothing but a file index,
>> because we can insert an index server here.
>>
>> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午4:45写道：
>>
>> > Hi Gang
>> >
>> > In my opinion, segments/partition pruning is actually in the scope of
>> > "Index system", we can have an "Index system" in storage level including
>> > File index(for segment/partition pruning), page index(for page pruning)
>> > etc. We can put all these stuff in such a system and make the
>> separation of
>> > duties cleaner.
>> >
>> >
>> > Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：
>> >
>> > > Awesome! Looking forward to the improvement. For dictionary, keep the
>> > > dictionary in query engine, most time is not good since it brings
>> lots of
>> > > pressure to Kylin server, but sometimes it has benefit, for example,
>> some
>> > > segments can be pruned very early when filter value is not in the
>> > > dictionary, and some queries can be answer directly using dictionary
>> as
>> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
>> > >
>> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org>
>> wrote:
>> > >
>> > > The dimension dictionary is a legacy design for HBase storage I think;
>> > > because HBase has no data type, everything is a byte array, this makes
>> > > Kylin has to encode STRING and other types with some encoding method
>> like
>> > > the dictionary.
>> > >
>> > > Now with the storage like Parquet, it would decide how to encode the
>> data
>> > > at the page or block level. Then we can drop the dictionary after the
>> > cube
>> > > is built. This will release the memory pressure of Kylin query nodes
>> and
>> > > also benefit the UHC case.
>> > >
>> > > Best regards,
>> > >
>> > > Shaofeng Shi 史少锋
>> > > Apache Kylin PMC
>> > > Work email: shaofeng.shi@kyligence.io
>> > > Kyligence Inc: https://kyligence.io/
>> > >
>> > > Apache Kylin FAQ:
>> https://kylin.apache.org/docs/gettingstarted/faq.html
>> > > Join Kylin user mail group: user-subscribe@kylin.apache.org
>> > > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>> > >
>> > >
>> > >
>> > >
>> > > Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
>> > >
>> > >>  In this PoC, we verified Kylin On Parquet is viable, but the query
>> > >> performance still have room to improve. We can improve it from the
>> > >> following aspects:
>> > >>
>> > >>  1, Minimize result set serialization time
>> > >>  Since Kylin need Object[] data to process, we convert Dataset to
>> RDD,
>> > >> and then convert the "Row" type to Object[], so Spark need to
>> serialize
>> > >> Object[] before return it to driver. Those time need to be avoided.
>> > >>
>> > >>  2, Query without dictionary
>> > >>  In this PoC, for less storage use, we keep dict encode value in
>> Parquet
>> > >> file for dict-encode dimensions, so Kylin must load dictionary to
>> > convert
>> > >> dict value for query. If we keep original value for dict-encode
>> > dimension,
>> > >> dictionary is unnecessary. And we don't hava to worry about the
>> storage
>> > >> use, because Parquet will encode it. We should remove dictionary from
>> > query.
>> > >>
>> > >>  3, Remove query single-point issue
>> > >>  In this PoC, we use Spark to read and process Cube data, which is
>> > >> distributed, but kylin alse need to process result data the Spark
>> > returned
>> > >> in single jvm. We can try to make it distributed too.
>> > >>
>> > >>  4, Upgrade Parquet to 1.11 for page index
>> > >>  In this PoC, Parquet don't have page index, we get a poor filter
>> > >> performance. We need to upgrade Parquet to version 1.11 which has
>> page
>> > >> index to improve filter performance.
>> > >>
>> > >> ------------------
>> > >> Best Regards,
>> > >> Chao Long
>> > >>
>> > >> ------------------ 原始邮件 ------------------
>> > >> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
>> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> > >> *收件人:* "dev"<de...@kylin.apache.org>;
>> > >> *主题:* Evaluate Kylin on Parquet
>> > >>
>> > >> Hello Kylin users,
>> > >>
>> > >> The first version of Kylin on Parquet [1] feature has been staged in
>> > >> Kylin code repository for public review and evaluation. You can check
>> > out
>> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
>> make a
>> > >> binary build to run an example. When creating a cube, you can select
>> > >> "Parquet" as the storage in the "Advanced setting" page. Both
>> MapReduce
>> > and
>> > >> Spark engines support this new storage. A tech blog is under drafting
>> > for
>> > >> the design and implementation.
>> > >>
>> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
>> Zhou!
>> > >>
>> > >> This is not the final version; there is room to improve in many
>> aspects,
>> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
>> Your
>> > >> comments are welcomed. Let's improve it together.
>> > >>
>> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>> > >>
>> > >> Best regards,
>> > >>
>> > >> Shaofeng Shi 史少锋
>> > >> Apache Kylin PMC
>> > >> Work email: shaofeng.shi@kyligence.io
>> > >> Kyligence Inc: https://kyligence.io/
>> > >>
>> > >> Apache Kylin FAQ:
>> https://kylin.apache.org/docs/gettingstarted/faq.html
>> > >> Join Kylin user mail group: user-subscribe@kylin.apache.org
>> > >> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>> > >>
>> > >>
>> > >>
>> > >
>> > >
>> > >
>> >
>> >
>> > --
>> >
>> >
>> > Regards!
>> >
>> > Aron Tao
>> >
>>
>
>

Re: Re: Evaluate Kylin on Parquet

Posted by JiaTao Tao <ta...@gmail.com>.

Hi all,

Truly agreed with Yiming, and here I expand a little more about
"Distributed computing".

As Yiming mentioned, Kylin will parse the query into an execution plan
using Calcite(Kylin will change the execution plan cuz the data in cubes is
already aggregated, we cannot use the origin plan directly). It's a tree
structure, a node represents a specific calculation and data goes from
bottom to top applying all these calculations.
[image: image.png]
(Pic from https://blog.csdn.net/yu616568/article/details/50838504, a really
good blog.)

At present, Kylin will do almost all these calculations only in its own
node, in other words, we cannot fully use the power of the cluster, and
it's a SPOF. And here comes a design that we can visit this tree, *and
transform each node into operations to Spark's Dataframes(i.e. "DF").*

More specifically, we will visit the nodes recursively until we met the
"TableScan" node(like a stack pushing operation). e.g. In the above
diagram, the first node we met is a "Sort" node, we just visit its
child(ren), and we'll not stop visiting each node's child(ren) until we met
a "TableScan" node.

In the "TableScan" node, we will generate the initial DF, then the DF will
be poped to the "Filter" node, and the "Filter" node will apply its own
operation like "df.filter(xxx)". Finally, we will apply each node's
operation to this DF, and the final call chain will like:
"df.filter(xxx).select(xxx).agg(xxx).sort(xxx)".

After we got the final Dataframe and triggered the calculation, all the
rest were handled by Spark. And we can gain tremendous benefits in
computation level, more details can be seen in my previous post:
http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html
.


-- 


Regards!

Aron Tao


许益铭 <x1...@gmail.com> 于2018年12月19日周三 上午11:40写道：

> hi All!
> 关于CHAO LONG提到的几个问题,我有以下几个看法:
>
> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
> server GC严重
>
>
> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.
>
> 3.我们要使用parquet的page
> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.
>
> 我觉得使用spark来做我们的计算引擎能解决上述所有问题:
>
> 1.分布式计算
> sql通过calcite解析优化之后会生成olap
>
> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
> server端的压力.
>
> 2.去掉字典
>
> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.
>
> 3.parquet存储使用列的真实类型,而不是使用binary
>
> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.
>
> 4.使用spark适配parquet
> 当前的spark已经适配了parquet,spark的pushed
> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
> index能力.
>
> 5.index server
> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
> index的一种,因为我们可以在这里插入一个index server.
>
>
> hi,all!
> I have the following views:
> 1. At present, our architecture is divided into two layers, one is the
> storage layer, and the other is the computing layer. In the storage layer,
> we have made some optimizations and do pre-aggregation in the storage layer
> to reduce the amount of data returned. However, the aggregation and
> connection of the runtime occurs on the kylin server side. Serialization is
> inevitable, and this architecture is easy to cause a single point
> bottleneck. If the agg or join data of the runtime is relatively large, the
> query performance will drop linearly, and the kylin server GC will be
> severe.
>
> 2. As for the dictionary problem, canceling dictionary encoding is a good
> choice. The dictionary was originally designed to align rowkey in hbase and
> also to reduce part of the storage. But this also introduces another
> problem, it is difficult to handle non-fixed string type dimension If you
> encounter a UHC dimension, you can only create a large dictionary or give a
> larger fix-length, which causes the storage to double, and because the
> dictionary is large, the query performance will be greatly affected. We use
> columnar storage, we don't need to consider this problem.
>
> 3. We need to use the page index of the parquet, we must convert the tuple
> filter into the filter of the parquet. This workload is not small. And our
> data is encoded. The page index of the parquet will only be based on the
> min and max value on the page. Filtering, so for binary data, it is
> impossible to do filter.
>
> I think using spark to do our calculation engine solves all of the above
> problems:
>
> Distributed computing
> Sql through calcite analysis optimization will generate a tree of OLAP rel,
> and spark's catalyst is also generated by parsing SQL after a tree,
> automatically optimized to become a dataframe to calculate, if the plan of
> calcite can be converted into a spark plan, then we will achieve
> distributed computing, calcite is only responsible for parsing SQL and
> returning result sets, reducing the pressure on the kylin server side.
>
> 2. Remove the dictionary
> The dictionary has a very good effect to reduce the storage pressure in the
> low and medium base, but there is also a disadvantage that its data files
> can not be used separately from the dictionary. I suggest that you can use
> the page level of the dictionary without considering the dictionary type
> encoding.
>
> 3.parquet storage uses the true type of the column instead of using binary
> As above, parquet has a very weak filter capability for binary, and the
> basic type can directly use spark's Vectorizedread to speed up data reading
> speed and calculation.
>
> 4. Use spark to match the parquet
> The current spark has been adapted to the parquet. The sparked filter of
> the spark has been converted into a filter that can be used by the parquet.
> Here, you only need to upgrade the version of the parcel and modify it to
> provide the page index of the parquet.
>
> 5.index server
> As described by JiaTao Tao, the index server is divided into file index and
> page index. The filtering of the dictionary is nothing but a file index,
> because we can insert an index server here.
>
> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午4:45写道：
>
> > Hi Gang
> >
> > In my opinion, segments/partition pruning is actually in the scope of
> > "Index system", we can have an "Index system" in storage level including
> > File index(for segment/partition pruning), page index(for page pruning)
> > etc. We can put all these stuff in such a system and make the separation
> of
> > duties cleaner.
> >
> >
> > Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：
> >
> > > Awesome! Looking forward to the improvement. For dictionary, keep the
> > > dictionary in query engine, most time is not good since it brings lots
> of
> > > pressure to Kylin server, but sometimes it has benefit, for example,
> some
> > > segments can be pruned very early when filter value is not in the
> > > dictionary, and some queries can be answer directly using dictionary as
> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
> > >
> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org> wrote:
> > >
> > > The dimension dictionary is a legacy design for HBase storage I think;
> > > because HBase has no data type, everything is a byte array, this makes
> > > Kylin has to encode STRING and other types with some encoding method
> like
> > > the dictionary.
> > >
> > > Now with the storage like Parquet, it would decide how to encode the
> data
> > > at the page or block level. Then we can drop the dictionary after the
> > cube
> > > is built. This will release the memory pressure of Kylin query nodes
> and
> > > also benefit the UHC case.
> > >
> > > Best regards,
> > >
> > > Shaofeng Shi 史少锋
> > > Apache Kylin PMC
> > > Work email: shaofeng.shi@kyligence.io
> > > Kyligence Inc: https://kyligence.io/
> > >
> > > Apache Kylin FAQ:
> https://kylin.apache.org/docs/gettingstarted/faq.html
> > > Join Kylin user mail group: user-subscribe@kylin.apache.org
> > > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> > >
> > >
> > >
> > >
> > > Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
> > >
> > >>  In this PoC, we verified Kylin On Parquet is viable, but the query
> > >> performance still have room to improve. We can improve it from the
> > >> following aspects:
> > >>
> > >>  1, Minimize result set serialization time
> > >>  Since Kylin need Object[] data to process, we convert Dataset to RDD,
> > >> and then convert the "Row" type to Object[], so Spark need to
> serialize
> > >> Object[] before return it to driver. Those time need to be avoided.
> > >>
> > >>  2, Query without dictionary
> > >>  In this PoC, for less storage use, we keep dict encode value in
> Parquet
> > >> file for dict-encode dimensions, so Kylin must load dictionary to
> > convert
> > >> dict value for query. If we keep original value for dict-encode
> > dimension,
> > >> dictionary is unnecessary. And we don't hava to worry about the
> storage
> > >> use, because Parquet will encode it. We should remove dictionary from
> > query.
> > >>
> > >>  3, Remove query single-point issue
> > >>  In this PoC, we use Spark to read and process Cube data, which is
> > >> distributed, but kylin alse need to process result data the Spark
> > returned
> > >> in single jvm. We can try to make it distributed too.
> > >>
> > >>  4, Upgrade Parquet to 1.11 for page index
> > >>  In this PoC, Parquet don't have page index, we get a poor filter
> > >> performance. We need to upgrade Parquet to version 1.11 which has page
> > >> index to improve filter performance.
> > >>
> > >> ------------------
> > >> Best Regards,
> > >> Chao Long
> > >>
> > >> ------------------ 原始邮件 ------------------
> > >> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
> > >> *收件人:* "dev"<de...@kylin.apache.org>;
> > >> *主题:* Evaluate Kylin on Parquet
> > >>
> > >> Hello Kylin users,
> > >>
> > >> The first version of Kylin on Parquet [1] feature has been staged in
> > >> Kylin code repository for public review and evaluation. You can check
> > out
> > >> the "kylin-on-parquet" branch [2] to read the code, and also can make
> a
> > >> binary build to run an example. When creating a cube, you can select
> > >> "Parquet" as the storage in the "Advanced setting" page. Both
> MapReduce
> > and
> > >> Spark engines support this new storage. A tech blog is under drafting
> > for
> > >> the design and implementation.
> > >>
> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
> > >>
> > >> This is not the final version; there is room to improve in many
> aspects,
> > >> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
> > >> comments are welcomed. Let's improve it together.
> > >>
> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
> > >>
> > >> Best regards,
> > >>
> > >> Shaofeng Shi 史少锋
> > >> Apache Kylin PMC
> > >> Work email: shaofeng.shi@kyligence.io
> > >> Kyligence Inc: https://kyligence.io/
> > >>
> > >> Apache Kylin FAQ:
> https://kylin.apache.org/docs/gettingstarted/faq.html
> > >> Join Kylin user mail group: user-subscribe@kylin.apache.org
> > >> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> > >>
> > >>
> > >>
> > >
> > >
> > >
> >
> >
> > --
> >
> >
> > Regards!
> >
> > Aron Tao
> >
>

Re: Re: Evaluate Kylin on Parquet

Posted by JiaTao Tao <ta...@gmail.com>.

Hi all,

Truly agreed with Yiming, and here I expand a little more about
"Distributed computing".

As Yiming mentioned, Kylin will parse the query into an execution plan
using Calcite(Kylin will change the execution plan cuz the data in cubes is
already aggregated, we cannot use the origin plan directly). It's a tree
structure, a node represents a specific calculation and data goes from
bottom to top applying all these calculations.
[image: image.png]
(Pic from https://blog.csdn.net/yu616568/article/details/50838504, a really
good blog.)

At present, Kylin will do almost all these calculations only in its own
node, in other words, we cannot fully use the power of the cluster, and
it's a SPOF. And here comes a design that we can visit this tree, *and
transform each node into operations to Spark's Dataframes(i.e. "DF").*

More specifically, we will visit the nodes recursively until we met the
"TableScan" node(like a stack pushing operation). e.g. In the above
diagram, the first node we met is a "Sort" node, we just visit its
child(ren), and we'll not stop visiting each node's child(ren) until we met
a "TableScan" node.

In the "TableScan" node, we will generate the initial DF, then the DF will
be poped to the "Filter" node, and the "Filter" node will apply its own
operation like "df.filter(xxx)". Finally, we will apply each node's
operation to this DF, and the final call chain will like:
"df.filter(xxx).select(xxx).agg(xxx).sort(xxx)".

After we got the final Dataframe and triggered the calculation, all the
rest were handled by Spark. And we can gain tremendous benefits in
computation level, more details can be seen in my previous post:
http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html
.


-- 


Regards!

Aron Tao


许益铭 <x1...@gmail.com> 于2018年12月19日周三 上午11:40写道：

> hi All!
> 关于CHAO LONG提到的几个问题,我有以下几个看法:
>
> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
> server GC严重
>
>
> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.
>
> 3.我们要使用parquet的page
> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.
>
> 我觉得使用spark来做我们的计算引擎能解决上述所有问题:
>
> 1.分布式计算
> sql通过calcite解析优化之后会生成olap
>
> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
> server端的压力.
>
> 2.去掉字典
>
> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.
>
> 3.parquet存储使用列的真实类型,而不是使用binary
>
> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.
>
> 4.使用spark适配parquet
> 当前的spark已经适配了parquet,spark的pushed
> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
> index能力.
>
> 5.index server
> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
> index的一种,因为我们可以在这里插入一个index server.
>
>
> hi,all!
> I have the following views:
> 1. At present, our architecture is divided into two layers, one is the
> storage layer, and the other is the computing layer. In the storage layer,
> we have made some optimizations and do pre-aggregation in the storage layer
> to reduce the amount of data returned. However, the aggregation and
> connection of the runtime occurs on the kylin server side. Serialization is
> inevitable, and this architecture is easy to cause a single point
> bottleneck. If the agg or join data of the runtime is relatively large, the
> query performance will drop linearly, and the kylin server GC will be
> severe.
>
> 2. As for the dictionary problem, canceling dictionary encoding is a good
> choice. The dictionary was originally designed to align rowkey in hbase and
> also to reduce part of the storage. But this also introduces another
> problem, it is difficult to handle non-fixed string type dimension If you
> encounter a UHC dimension, you can only create a large dictionary or give a
> larger fix-length, which causes the storage to double, and because the
> dictionary is large, the query performance will be greatly affected. We use
> columnar storage, we don't need to consider this problem.
>
> 3. We need to use the page index of the parquet, we must convert the tuple
> filter into the filter of the parquet. This workload is not small. And our
> data is encoded. The page index of the parquet will only be based on the
> min and max value on the page. Filtering, so for binary data, it is
> impossible to do filter.
>
> I think using spark to do our calculation engine solves all of the above
> problems:
>
> Distributed computing
> Sql through calcite analysis optimization will generate a tree of OLAP rel,
> and spark's catalyst is also generated by parsing SQL after a tree,
> automatically optimized to become a dataframe to calculate, if the plan of
> calcite can be converted into a spark plan, then we will achieve
> distributed computing, calcite is only responsible for parsing SQL and
> returning result sets, reducing the pressure on the kylin server side.
>
> 2. Remove the dictionary
> The dictionary has a very good effect to reduce the storage pressure in the
> low and medium base, but there is also a disadvantage that its data files
> can not be used separately from the dictionary. I suggest that you can use
> the page level of the dictionary without considering the dictionary type
> encoding.
>
> 3.parquet storage uses the true type of the column instead of using binary
> As above, parquet has a very weak filter capability for binary, and the
> basic type can directly use spark's Vectorizedread to speed up data reading
> speed and calculation.
>
> 4. Use spark to match the parquet
> The current spark has been adapted to the parquet. The sparked filter of
> the spark has been converted into a filter that can be used by the parquet.
> Here, you only need to upgrade the version of the parcel and modify it to
> provide the page index of the parquet.
>
> 5.index server
> As described by JiaTao Tao, the index server is divided into file index and
> page index. The filtering of the dictionary is nothing but a file index,
> because we can insert an index server here.
>
> JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午4:45写道：
>
> > Hi Gang
> >
> > In my opinion, segments/partition pruning is actually in the scope of
> > "Index system", we can have an "Index system" in storage level including
> > File index(for segment/partition pruning), page index(for page pruning)
> > etc. We can put all these stuff in such a system and make the separation
> of
> > duties cleaner.
> >
> >
> > Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：
> >
> > > Awesome! Looking forward to the improvement. For dictionary, keep the
> > > dictionary in query engine, most time is not good since it brings lots
> of
> > > pressure to Kylin server, but sometimes it has benefit, for example,
> some
> > > segments can be pruned very early when filter value is not in the
> > > dictionary, and some queries can be answer directly using dictionary as
> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
> > >
> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org> wrote:
> > >
> > > The dimension dictionary is a legacy design for HBase storage I think;
> > > because HBase has no data type, everything is a byte array, this makes
> > > Kylin has to encode STRING and other types with some encoding method
> like
> > > the dictionary.
> > >
> > > Now with the storage like Parquet, it would decide how to encode the
> data
> > > at the page or block level. Then we can drop the dictionary after the
> > cube
> > > is built. This will release the memory pressure of Kylin query nodes
> and
> > > also benefit the UHC case.
> > >
> > > Best regards,
> > >
> > > Shaofeng Shi 史少锋
> > > Apache Kylin PMC
> > > Work email: shaofeng.shi@kyligence.io
> > > Kyligence Inc: https://kyligence.io/
> > >
> > > Apache Kylin FAQ:
> https://kylin.apache.org/docs/gettingstarted/faq.html
> > > Join Kylin user mail group: user-subscribe@kylin.apache.org
> > > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> > >
> > >
> > >
> > >
> > > Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
> > >
> > >>  In this PoC, we verified Kylin On Parquet is viable, but the query
> > >> performance still have room to improve. We can improve it from the
> > >> following aspects:
> > >>
> > >>  1, Minimize result set serialization time
> > >>  Since Kylin need Object[] data to process, we convert Dataset to RDD,
> > >> and then convert the "Row" type to Object[], so Spark need to
> serialize
> > >> Object[] before return it to driver. Those time need to be avoided.
> > >>
> > >>  2, Query without dictionary
> > >>  In this PoC, for less storage use, we keep dict encode value in
> Parquet
> > >> file for dict-encode dimensions, so Kylin must load dictionary to
> > convert
> > >> dict value for query. If we keep original value for dict-encode
> > dimension,
> > >> dictionary is unnecessary. And we don't hava to worry about the
> storage
> > >> use, because Parquet will encode it. We should remove dictionary from
> > query.
> > >>
> > >>  3, Remove query single-point issue
> > >>  In this PoC, we use Spark to read and process Cube data, which is
> > >> distributed, but kylin alse need to process result data the Spark
> > returned
> > >> in single jvm. We can try to make it distributed too.
> > >>
> > >>  4, Upgrade Parquet to 1.11 for page index
> > >>  In this PoC, Parquet don't have page index, we get a poor filter
> > >> performance. We need to upgrade Parquet to version 1.11 which has page
> > >> index to improve filter performance.
> > >>
> > >> ------------------
> > >> Best Regards,
> > >> Chao Long
> > >>
> > >> ------------------ 原始邮件 ------------------
> > >> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
> > >> *收件人:* "dev"<de...@kylin.apache.org>;
> > >> *主题:* Evaluate Kylin on Parquet
> > >>
> > >> Hello Kylin users,
> > >>
> > >> The first version of Kylin on Parquet [1] feature has been staged in
> > >> Kylin code repository for public review and evaluation. You can check
> > out
> > >> the "kylin-on-parquet" branch [2] to read the code, and also can make
> a
> > >> binary build to run an example. When creating a cube, you can select
> > >> "Parquet" as the storage in the "Advanced setting" page. Both
> MapReduce
> > and
> > >> Spark engines support this new storage. A tech blog is under drafting
> > for
> > >> the design and implementation.
> > >>
> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
> > >>
> > >> This is not the final version; there is room to improve in many
> aspects,
> > >> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
> > >> comments are welcomed. Let's improve it together.
> > >>
> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
> > >>
> > >> Best regards,
> > >>
> > >> Shaofeng Shi 史少锋
> > >> Apache Kylin PMC
> > >> Work email: shaofeng.shi@kyligence.io
> > >> Kyligence Inc: https://kyligence.io/
> > >>
> > >> Apache Kylin FAQ:
> https://kylin.apache.org/docs/gettingstarted/faq.html
> > >> Join Kylin user mail group: user-subscribe@kylin.apache.org
> > >> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> > >>
> > >>
> > >>
> > >
> > >
> > >
> >
> >
> > --
> >
> >
> > Regards!
> >
> > Aron Tao
> >
>

Re: Re: Evaluate Kylin on Parquet

Posted by 许益铭 <x1...@gmail.com>.

hi All!
关于CHAO LONG提到的几个问题,我有以下几个看法:
1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
server GC严重

2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.

3.我们要使用parquet的page
index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.

我觉得使用spark来做我们的计算引擎能解决上述所有问题:

1.分布式计算
sql通过calcite解析优化之后会生成olap
rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
server端的压力.

2.去掉字典
字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.

3.parquet存储使用列的真实类型,而不是使用binary
如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.

4.使用spark适配parquet
当前的spark已经适配了parquet,spark的pushed
filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
index能力.

5.index server
就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
index的一种,因为我们可以在这里插入一个index server.

hi,all!
I have the following views:
1. At present, our architecture is divided into two layers, one is the
storage layer, and the other is the computing layer. In the storage layer,
we have made some optimizations and do pre-aggregation in the storage layer
to reduce the amount of data returned. However, the aggregation and
connection of the runtime occurs on the kylin server side. Serialization is
inevitable, and this architecture is easy to cause a single point
bottleneck. If the agg or join data of the runtime is relatively large, the
query performance will drop linearly, and the kylin server GC will be
severe.

2. As for the dictionary problem, canceling dictionary encoding is a good
choice. The dictionary was originally designed to align rowkey in hbase and
also to reduce part of the storage. But this also introduces another
problem, it is difficult to handle non-fixed string type dimension If you
encounter a UHC dimension, you can only create a large dictionary or give a
larger fix-length, which causes the storage to double, and because the
dictionary is large, the query performance will be greatly affected. We use
columnar storage, we don't need to consider this problem.

3. We need to use the page index of the parquet, we must convert the tuple
filter into the filter of the parquet. This workload is not small. And our
data is encoded. The page index of the parquet will only be based on the
min and max value on the page. Filtering, so for binary data, it is
impossible to do filter.

I think using spark to do our calculation engine solves all of the above
problems:

Distributed computing
Sql through calcite analysis optimization will generate a tree of OLAP rel,
and spark's catalyst is also generated by parsing SQL after a tree,
automatically optimized to become a dataframe to calculate, if the plan of
calcite can be converted into a spark plan, then we will achieve
distributed computing, calcite is only responsible for parsing SQL and
returning result sets, reducing the pressure on the kylin server side.

2. Remove the dictionary
The dictionary has a very good effect to reduce the storage pressure in the
low and medium base, but there is also a disadvantage that its data files
can not be used separately from the dictionary. I suggest that you can use
the page level of the dictionary without considering the dictionary type
encoding.

3.parquet storage uses the true type of the column instead of using binary
As above, parquet has a very weak filter capability for binary, and the
basic type can directly use spark's Vectorizedread to speed up data reading
speed and calculation.

4. Use spark to match the parquet
The current spark has been adapted to the parquet. The sparked filter of
the spark has been converted into a filter that can be used by the parquet.
Here, you only need to upgrade the version of the parcel and modify it to
provide the page index of the parquet.

5.index server
As described by JiaTao Tao, the index server is divided into file index and
page index. The filtering of the dictionary is nothing but a file index,
because we can insert an index server here.

JiaTao Tao <ta...@gmail.com> 于2018年12月19日周三 下午4:45写道：

> Hi Gang
>
> In my opinion, segments/partition pruning is actually in the scope of
> "Index system", we can have an "Index system" in storage level including
> File index(for segment/partition pruning), page index(for page pruning)
> etc. We can put all these stuff in such a system and make the separation of
> duties cleaner.
>
>
> Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：
>
> > Awesome! Looking forward to the improvement. For dictionary, keep the
> > dictionary in query engine, most time is not good since it brings lots of
> > pressure to Kylin server, but sometimes it has benefit, for example, some
> > segments can be pruned very early when filter value is not in the
> > dictionary, and some queries can be answer directly using dictionary as
> > described in: https://issues.apache.org/jira/browse/KYLIN-3490
> >
> > At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org> wrote:
> >
> > The dimension dictionary is a legacy design for HBase storage I think;
> > because HBase has no data type, everything is a byte array, this makes
> > Kylin has to encode STRING and other types with some encoding method like
> > the dictionary.
> >
> > Now with the storage like Parquet, it would decide how to encode the data
> > at the page or block level. Then we can drop the dictionary after the
> cube
> > is built. This will release the memory pressure of Kylin query nodes and
> > also benefit the UHC case.
> >
> > Best regards,
> >
> > Shaofeng Shi 史少锋
> > Apache Kylin PMC
> > Work email: shaofeng.shi@kyligence.io
> > Kyligence Inc: https://kyligence.io/
> >
> > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> > Join Kylin user mail group: user-subscribe@kylin.apache.org
> > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> >
> >
> >
> >
> > Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
> >
> >>  In this PoC, we verified Kylin On Parquet is viable, but the query
> >> performance still have room to improve. We can improve it from the
> >> following aspects:
> >>
> >>  1, Minimize result set serialization time
> >>  Since Kylin need Object[] data to process, we convert Dataset to RDD,
> >> and then convert the "Row" type to Object[], so Spark need to serialize
> >> Object[] before return it to driver. Those time need to be avoided.
> >>
> >>  2, Query without dictionary
> >>  In this PoC, for less storage use, we keep dict encode value in Parquet
> >> file for dict-encode dimensions, so Kylin must load dictionary to
> convert
> >> dict value for query. If we keep original value for dict-encode
> dimension,
> >> dictionary is unnecessary. And we don't hava to worry about the storage
> >> use, because Parquet will encode it. We should remove dictionary from
> query.
> >>
> >>  3, Remove query single-point issue
> >>  In this PoC, we use Spark to read and process Cube data, which is
> >> distributed, but kylin alse need to process result data the Spark
> returned
> >> in single jvm. We can try to make it distributed too.
> >>
> >>  4, Upgrade Parquet to 1.11 for page index
> >>  In this PoC, Parquet don't have page index, we get a poor filter
> >> performance. We need to upgrade Parquet to version 1.11 which has page
> >> index to improve filter performance.
> >>
> >> ------------------
> >> Best Regards,
> >> Chao Long
> >>
> >> ------------------ 原始邮件 ------------------
> >> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
> >> *发送时间:* 2018年12月14日(星期五) 下午4:39
> >> *收件人:* "dev"<de...@kylin.apache.org>;
> >> *主题:* Evaluate Kylin on Parquet
> >>
> >> Hello Kylin users,
> >>
> >> The first version of Kylin on Parquet [1] feature has been staged in
> >> Kylin code repository for public review and evaluation. You can check
> out
> >> the "kylin-on-parquet" branch [2] to read the code, and also can make a
> >> binary build to run an example. When creating a cube, you can select
> >> "Parquet" as the storage in the "Advanced setting" page. Both MapReduce
> and
> >> Spark engines support this new storage. A tech blog is under drafting
> for
> >> the design and implementation.
> >>
> >> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
> >>
> >> This is not the final version; there is room to improve in many aspects,
> >> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
> >> comments are welcomed. Let's improve it together.
> >>
> >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
> >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
> >>
> >> Best regards,
> >>
> >> Shaofeng Shi 史少锋
> >> Apache Kylin PMC
> >> Work email: shaofeng.shi@kyligence.io
> >> Kyligence Inc: https://kyligence.io/
> >>
> >> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> >> Join Kylin user mail group: user-subscribe@kylin.apache.org
> >> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> >>
> >>
> >>
> >
> >
> >
>
>
> --
>
>
> Regards!
>
> Aron Tao
>

Re: Re: Evaluate Kylin on Parquet

Posted by JiaTao Tao <ta...@gmail.com>.

Hi Gang

In my opinion, segments/partition pruning is actually in the scope of
"Index system", we can have an "Index system" in storage level including
File index(for segment/partition pruning), page index(for page pruning)
etc. We can put all these stuff in such a system and make the separation of
duties cleaner.


Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：

> Awesome! Looking forward to the improvement. For dictionary, keep the
> dictionary in query engine, most time is not good since it brings lots of
> pressure to Kylin server, but sometimes it has benefit, for example, some
> segments can be pruned very early when filter value is not in the
> dictionary, and some queries can be answer directly using dictionary as
> described in: https://issues.apache.org/jira/browse/KYLIN-3490
>
> At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org> wrote:
>
> The dimension dictionary is a legacy design for HBase storage I think;
> because HBase has no data type, everything is a byte array, this makes
> Kylin has to encode STRING and other types with some encoding method like
> the dictionary.
>
> Now with the storage like Parquet, it would decide how to encode the data
> at the page or block level. Then we can drop the dictionary after the cube
> is built. This will release the memory pressure of Kylin query nodes and
> also benefit the UHC case.
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Work email: shaofeng.shi@kyligence.io
> Kyligence Inc: https://kyligence.io/
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
>
>>  In this PoC, we verified Kylin On Parquet is viable, but the query
>> performance still have room to improve. We can improve it from the
>> following aspects:
>>
>>  1, Minimize result set serialization time
>>  Since Kylin need Object[] data to process, we convert Dataset to RDD,
>> and then convert the "Row" type to Object[], so Spark need to serialize
>> Object[] before return it to driver. Those time need to be avoided.
>>
>>  2, Query without dictionary
>>  In this PoC, for less storage use, we keep dict encode value in Parquet
>> file for dict-encode dimensions, so Kylin must load dictionary to convert
>> dict value for query. If we keep original value for dict-encode dimension,
>> dictionary is unnecessary. And we don't hava to worry about the storage
>> use, because Parquet will encode it. We should remove dictionary from query.
>>
>>  3, Remove query single-point issue
>>  In this PoC, we use Spark to read and process Cube data, which is
>> distributed, but kylin alse need to process result data the Spark returned
>> in single jvm. We can try to make it distributed too.
>>
>>  4, Upgrade Parquet to 1.11 for page index
>>  In this PoC, Parquet don't have page index, we get a poor filter
>> performance. We need to upgrade Parquet to version 1.11 which has page
>> index to improve filter performance.
>>
>> ------------------
>> Best Regards,
>> Chao Long
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
>> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> *收件人:* "dev"<de...@kylin.apache.org>;
>> *主题:* Evaluate Kylin on Parquet
>>
>> Hello Kylin users,
>>
>> The first version of Kylin on Parquet [1] feature has been staged in
>> Kylin code repository for public review and evaluation. You can check out
>> the "kylin-on-parquet" branch [2] to read the code, and also can make a
>> binary build to run an example. When creating a cube, you can select
>> "Parquet" as the storage in the "Advanced setting" page. Both MapReduce and
>> Spark engines support this new storage. A tech blog is under drafting for
>> the design and implementation.
>>
>> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
>>
>> This is not the final version; there is room to improve in many aspects,
>> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
>> comments are welcomed. Let's improve it together.
>>
>> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>>
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>> Apache Kylin PMC
>> Work email: shaofeng.shi@kyligence.io
>> Kyligence Inc: https://kyligence.io/
>>
>> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
>> Join Kylin user mail group: user-subscribe@kylin.apache.org
>> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>>
>>
>>
>
>
>


-- 


Regards!

Aron Tao

Re: Re: Evaluate Kylin on Parquet

Posted by JiaTao Tao <ta...@gmail.com>.

Hi Gang

In my opinion, segments/partition pruning is actually in the scope of
"Index system", we can have an "Index system" in storage level including
File index(for segment/partition pruning), page index(for page pruning)
etc. We can put all these stuff in such a system and make the separation of
duties cleaner.


Ma Gang <mg...@163.com> 于2018年12月19日周三 上午6:31写道：

> Awesome! Looking forward to the improvement. For dictionary, keep the
> dictionary in query engine, most time is not good since it brings lots of
> pressure to Kylin server, but sometimes it has benefit, for example, some
> segments can be pruned very early when filter value is not in the
> dictionary, and some queries can be answer directly using dictionary as
> described in: https://issues.apache.org/jira/browse/KYLIN-3490
>
> At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org> wrote:
>
> The dimension dictionary is a legacy design for HBase storage I think;
> because HBase has no data type, everything is a byte array, this makes
> Kylin has to encode STRING and other types with some encoding method like
> the dictionary.
>
> Now with the storage like Parquet, it would decide how to encode the data
> at the page or block level. Then we can drop the dictionary after the cube
> is built. This will release the memory pressure of Kylin query nodes and
> also benefit the UHC case.
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Work email: shaofeng.shi@kyligence.io
> Kyligence Inc: https://kyligence.io/
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：
>
>>  In this PoC, we verified Kylin On Parquet is viable, but the query
>> performance still have room to improve. We can improve it from the
>> following aspects:
>>
>>  1, Minimize result set serialization time
>>  Since Kylin need Object[] data to process, we convert Dataset to RDD,
>> and then convert the "Row" type to Object[], so Spark need to serialize
>> Object[] before return it to driver. Those time need to be avoided.
>>
>>  2, Query without dictionary
>>  In this PoC, for less storage use, we keep dict encode value in Parquet
>> file for dict-encode dimensions, so Kylin must load dictionary to convert
>> dict value for query. If we keep original value for dict-encode dimension,
>> dictionary is unnecessary. And we don't hava to worry about the storage
>> use, because Parquet will encode it. We should remove dictionary from query.
>>
>>  3, Remove query single-point issue
>>  In this PoC, we use Spark to read and process Cube data, which is
>> distributed, but kylin alse need to process result data the Spark returned
>> in single jvm. We can try to make it distributed too.
>>
>>  4, Upgrade Parquet to 1.11 for page index
>>  In this PoC, Parquet don't have page index, we get a poor filter
>> performance. We need to upgrade Parquet to version 1.11 which has page
>> index to improve filter performance.
>>
>> ------------------
>> Best Regards,
>> Chao Long
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
>> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> *收件人:* "dev"<de...@kylin.apache.org>;
>> *主题:* Evaluate Kylin on Parquet
>>
>> Hello Kylin users,
>>
>> The first version of Kylin on Parquet [1] feature has been staged in
>> Kylin code repository for public review and evaluation. You can check out
>> the "kylin-on-parquet" branch [2] to read the code, and also can make a
>> binary build to run an example. When creating a cube, you can select
>> "Parquet" as the storage in the "Advanced setting" page. Both MapReduce and
>> Spark engines support this new storage. A tech blog is under drafting for
>> the design and implementation.
>>
>> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
>>
>> This is not the final version; there is room to improve in many aspects,
>> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
>> comments are welcomed. Let's improve it together.
>>
>> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>>
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>> Apache Kylin PMC
>> Work email: shaofeng.shi@kyligence.io
>> Kyligence Inc: https://kyligence.io/
>>
>> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
>> Join Kylin user mail group: user-subscribe@kylin.apache.org
>> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>>
>>
>>
>
>
>


-- 


Regards!

Aron Tao

Re:Re: Evaluate Kylin on Parquet

Posted by Ma Gang <mg...@163.com>.

Awesome! Looking forward to the improvement. For dictionary, keep the dictionary in query engine, most time is not good since it brings lots of pressure to Kylin server, but sometimes it has benefit, for example, some segments can be pruned very early when filter value is not in the dictionary, and some queries can be answer directly using dictionary as described in: https://issues.apache.org/jira/browse/KYLIN-3490

At 2018-12-17 15:36:01, "ShaoFeng Shi" <sh...@apache.org> wrote:

The dimension dictionary is a legacy design for HBase storage I think; because HBase has no data type, everything is a byte array, this makes Kylin has to encode STRING and other types with some encoding method like the dictionary.

Now with the storage like Parquet, it would decide how to encode the data at the page or block level. Then we can drop the dictionary after the cube is built. This will release the memory pressure of Kylin query nodes and also benefit the UHC case.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io

Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org

Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：

In this PoC, we verified Kylin On Parquet is viable, but the query performance still have room to improve. We can improve it from the following aspects:

------------------
Best Regards,
Chao Long

------------------ 原始邮件 ------------------
发件人: "ShaoFeng Shi"<sh...@apache.org>;
发送时间: 2018年12月14日(星期五) 下午4:39
收件人: "dev"<de...@kylin.apache.org>;
主题: Evaluate Kylin on Parquet

Hello Kylin users,

Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!

This is not the final version; there is room to improve in many aspects, parquet, spark, and Kylin. It can be used for PoC at this moment. Your comments are welcomed. Let's improve it together.

[1] https://issues.apache.org/jira/browse/KYLIN-3621
[2] https://github.com/apache/kylin/tree/kylin-on-parquet

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io

Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org

Re: Evaluate Kylin on Parquet

Posted by ShaoFeng Shi <sh...@apache.org>.

The dimension dictionary is a legacy design for HBase storage I think;
because HBase has no data type, everything is a byte array, this makes
Kylin has to encode STRING and other types with some encoding method like
the dictionary.

Now with the storage like Parquet, it would decide how to encode the data
at the page or block level. Then we can drop the dictionary after the cube
is built. This will release the memory pressure of Kylin query nodes and
also benefit the UHC case.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Chao Long <wa...@qq.com> 于2018年12月17日周一 下午1:23写道：

>  In this PoC, we verified Kylin On Parquet is viable, but the query
> performance still have room to improve. We can improve it from the
> following aspects:
>
>  1, Minimize result set serialization time
>  Since Kylin need Object[] data to process, we convert Dataset to RDD, and
> then convert the "Row" type to Object[], so Spark need to serialize
> Object[] before return it to driver. Those time need to be avoided.
>
>  2, Query without dictionary
>  In this PoC, for less storage use, we keep dict encode value in Parquet
> file for dict-encode dimensions, so Kylin must load dictionary to convert
> dict value for query. If we keep original value for dict-encode dimension,
> dictionary is unnecessary. And we don't hava to worry about the storage
> use, because Parquet will encode it. We should remove dictionary from query.
>
>  3, Remove query single-point issue
>  In this PoC, we use Spark to read and process Cube data, which is
> distributed, but kylin alse need to process result data the Spark returned
> in single jvm. We can try to make it distributed too.
>
>  4, Upgrade Parquet to 1.11 for page index
>  In this PoC, Parquet don't have page index, we get a poor filter
> performance. We need to upgrade Parquet to version 1.11 which has page
> index to improve filter performance.
>
> ------------------
> Best Regards,
> Chao Long
>
> ------------------ 原始邮件 ------------------
> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
> *发送时间:* 2018年12月14日(星期五) 下午4:39
> *收件人:* "dev"<de...@kylin.apache.org>;
> *主题:* Evaluate Kylin on Parquet
>
> Hello Kylin users,
>
> The first version of Kylin on Parquet [1] feature has been staged in Kylin
> code repository for public review and evaluation. You can check out the
> "kylin-on-parquet" branch [2] to read the code, and also can make a binary
> build to run an example. When creating a cube, you can select "Parquet" as
> the storage in the "Advanced setting" page. Both MapReduce and Spark
> engines support this new storage. A tech blog is under drafting for the
> design and implementation.
>
> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
>
> This is not the final version; there is room to improve in many aspects,
> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
> comments are welcomed. Let's improve it together.
>
> [1] https://issues.apache.org/jira/browse/KYLIN-3621
> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Work email: shaofeng.shi@kyligence.io
> Kyligence Inc: https://kyligence.io/
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>