You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Xiaoxiang Yu <xi...@kyligence.io> on 2019/03/18 02:27:39 UTC

[Discussion] Enable shrunken dictionary by default

Dear all,
I suggest enable "kylin.dictionary.shrunken-from-global-enabled" by default(it is disabled by default), because I found enable it will speed up cube build process when cube have count distinct(bitmap) on a large cardinality column. This feature is contributed in KYLIN-3491.

When using count distinct(bitmap) measure on a large cardinality column(this require global dictionary), build base cuboid step need frequent cache swap so it cannot finished within a reasonable period. KYLIN-3491 add a new step to build separated dictionary for each InputSplit before BuildBaseCuboid step. So mapper of BuildBaseCuboid step only has to fetch a smaller dictionary for itself(without unused value), instead of a larger global dictionary. It will reduce cache swap and make BuildBaseCuboid step run as quick as possible.

In my test env, my hadoop cluster is a CDH cluster with 56 vcore and 110GB Memory. I create a model with a fact table (153326740 rows) and three dimension tables, there are three count distinct(bitmap) measure which the largest cardinality of single column is 55200325. With ShrunkenDict disabled, the BuildBaseCuboid cannot completed in 22 hours. Comparatively, with ShrunkenDict enabled, build process completed in a reasonable duration(Extra Dictionary cost 5 minutes, Build Base Cuboid costs 5 minutes).

https://user-images.githubusercontent.com/14030549/54363305-ad25e200-46a5-11e9-8bc7-fe2c385c0278.png

If you want know more, please check https://issues.apache.org/jira/browse/KYLIN-3491. If you have any suggestion, please let me know.

----------------
Best wishes,
Xiaoxiang Yu


Re: [Discussion] Enable shrunken dictionary by default

Posted by JiaTao Tao <ta...@gmail.com>.
+1, seems improved a lot.


-- 


Regards!

Aron Tao

Xiaoxiang Yu <xi...@kyligence.io> 于2019年3月18日周一 上午2:27写道:

> Dear all,
> I suggest enable "kylin.dictionary.shrunken-from-global-enabled" by
> default(it is disabled by default), because I found enable it will speed up
> cube build process when cube have count distinct(bitmap) on a large
> cardinality column. This feature is contributed in KYLIN-3491.
>
> When using count distinct(bitmap) measure on a large cardinality
> column(this require global dictionary), build base cuboid step need
> frequent cache swap so it cannot finished within a reasonable period.
> KYLIN-3491 add a new step to build separated dictionary for each InputSplit
> before BuildBaseCuboid step. So mapper of BuildBaseCuboid step only has to
> fetch a smaller dictionary for itself(without unused value), instead of a
> larger global dictionary. It will reduce cache swap and make
> BuildBaseCuboid step run as quick as possible.
>
> In my test env, my hadoop cluster is a CDH cluster with 56 vcore and 110GB
> Memory. I create a model with a fact table (153326740 rows) and three
> dimension tables, there are three count distinct(bitmap) measure which the
> largest cardinality of single column is 55200325. With ShrunkenDict
> disabled, the BuildBaseCuboid cannot completed in 22 hours. Comparatively,
> with ShrunkenDict enabled, build process completed in a reasonable
> duration(Extra Dictionary cost 5 minutes, Build Base Cuboid costs 5
> minutes).
>
>
> https://user-images.githubusercontent.com/14030549/54363305-ad25e200-46a5-11e9-8bc7-fe2c385c0278.png
>
> If you want know more, please check
> https://issues.apache.org/jira/browse/KYLIN-3491. If you have any
> suggestion, please let me know.
>
> ----------------
> Best wishes,
> Xiaoxiang Yu
>
>

回复: [Discussion] Enable shrunken dictionary by default

Posted by Chao Long <wa...@qq.com>.
+1
------------------
Best Regards,
Chao Long


------------------ 原始邮件 ------------------
发件人: "Zhong, Yanghong"<ya...@ebay.com.INVALID>;
发送时间: 2019年3月18日(星期一) 上午10:30
收件人: "dev@kylin.apache.org"<de...@kylin.apache.org>;
抄送: "Xiaoxiang Yu"<xi...@kyligence.io>; 
主题: Re: [Discussion] Enable shrunken dictionary by default



+1.

Best regards,
Yanghong Zhong

On 2019/3/18, 10:27 AM, "Xiaoxiang Yu" <xi...@kyligence.io> wrote:

    Dear all,
    I suggest enable "kylin.dictionary.shrunken-from-global-enabled" by default(it is disabled by default), because I found enable it will speed up cube build process when cube have count distinct(bitmap) on a large cardinality column. This feature is contributed in KYLIN-3491.
    
    When using count distinct(bitmap) measure on a large cardinality column(this require global dictionary), build base cuboid step need frequent cache swap so it cannot finished within a reasonable period. KYLIN-3491 add a new step to build separated dictionary for each InputSplit before BuildBaseCuboid step. So mapper of BuildBaseCuboid step only has to fetch a smaller dictionary for itself(without unused value), instead of a larger global dictionary. It will reduce cache swap and make BuildBaseCuboid step run as quick as possible.
    
    In my test env, my hadoop cluster is a CDH cluster with 56 vcore and 110GB Memory. I create a model with a fact table (153326740 rows) and three dimension tables, there are three count distinct(bitmap) measure which the largest cardinality of single column is 55200325. With ShrunkenDict disabled, the BuildBaseCuboid cannot completed in 22 hours. Comparatively, with ShrunkenDict enabled, build process completed in a reasonable duration(Extra Dictionary cost 5 minutes, Build Base Cuboid costs 5 minutes).
    
    https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F14030549%2F54363305-ad25e200-46a5-11e9-8bc7-fe2c385c0278.png&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C5f549f14059d4731d7a808d6ab4954ef%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636884728786178583&amp;sdata=KuUcbcerY42oG4J11G1jlEcIs4v%2BPPVt40B9G9fqa80%3D&amp;reserved=0
    
    If you want know more, please check https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FKYLIN-3491&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C5f549f14059d4731d7a808d6ab4954ef%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636884728786178583&amp;sdata=T1P1rCA1munwUedC0PC4qttqbFqiDkda%2FZ%2BgqgkQn%2BE%3D&amp;reserved=0. If you have any suggestion, please let me know.
    
    ----------------
    Best wishes,
    Xiaoxiang Yu

答复: [Discussion] Enable shrunken dictionary by default

Posted by Na Zhai <na...@kyligence.io>.
+1



发送自 Windows 10 版邮件<https://go.microsoft.com/fwlink/?LinkId=550986>应用



________________________________
发件人: Billy Liu <bi...@apache.org>
发送时间: Monday, March 18, 2019 11:50:49 AM
收件人: dev
抄送: Xiaoxiang Yu
主题: Re: [Discussion] Enable shrunken dictionary by default

22 hours to 5 minutes, incredible progress.
+1

With Warm regards

Billy Liu

ShaoFeng Shi <sh...@apache.org> 于2019年3月18日周一 上午2:59写道:
>
> +1.
>
> Thanks to Xiaoxiang for raising this; Kylin has some advanced but hidden
> feature. As the function becomes stable, we should enable them by default
> to benefit all users.
>
> Please also raise similar discussion if you wish to enable some good
> features.
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> Zhong, Yanghong <ya...@ebay.com.invalid> 于2019年3月18日周一 上午10:39写道:
>
> > +1.
> >
> > Best regards,
> > Yanghong Zhong
> >
> > On 2019/3/18, 10:27 AM, "Xiaoxiang Yu" <xi...@kyligence.io> wrote:
> >
> >     Dear all,
> >     I suggest enable "kylin.dictionary.shrunken-from-global-enabled" by
> > default(it is disabled by default), because I found enable it will speed up
> > cube build process when cube have count distinct(bitmap) on a large
> > cardinality column. This feature is contributed in KYLIN-3491.
> >
> >     When using count distinct(bitmap) measure on a large cardinality
> > column(this require global dictionary), build base cuboid step need
> > frequent cache swap so it cannot finished within a reasonable period.
> > KYLIN-3491 add a new step to build separated dictionary for each InputSplit
> > before BuildBaseCuboid step. So mapper of BuildBaseCuboid step only has to
> > fetch a smaller dictionary for itself(without unused value), instead of a
> > larger global dictionary. It will reduce cache swap and make
> > BuildBaseCuboid step run as quick as possible.
> >
> >     In my test env, my hadoop cluster is a CDH cluster with 56 vcore and
> > 110GB Memory. I create a model with a fact table (153326740 rows) and three
> > dimension tables, there are three count distinct(bitmap) measure which the
> > largest cardinality of single column is 55200325. With ShrunkenDict
> > disabled, the BuildBaseCuboid cannot completed in 22 hours. Comparatively,
> > with ShrunkenDict enabled, build process completed in a reasonable
> > duration(Extra Dictionary cost 5 minutes, Build Base Cuboid costs 5
> > minutes).
> >
> >
> > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F14030549%2F54363305-ad25e200-46a5-11e9-8bc7-fe2c385c0278.png&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C5f549f14059d4731d7a808d6ab4954ef%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636884728786178583&amp;sdata=KuUcbcerY42oG4J11G1jlEcIs4v%2BPPVt40B9G9fqa80%3D&amp;reserved=0
> >
> >     If you want know more, please check
> > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FKYLIN-3491&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C5f549f14059d4731d7a808d6ab4954ef%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636884728786178583&amp;sdata=T1P1rCA1munwUedC0PC4qttqbFqiDkda%2FZ%2BgqgkQn%2BE%3D&amp;reserved=0.
> > If you have any suggestion, please let me know.
> >
> >     ----------------
> >     Best wishes,
> >     Xiaoxiang Yu
> >
> >
> >
> >

Re: [Discussion] Enable shrunken dictionary by default

Posted by Billy Liu <bi...@apache.org>.
22 hours to 5 minutes, incredible progress.
+1

With Warm regards

Billy Liu

ShaoFeng Shi <sh...@apache.org> 于2019年3月18日周一 上午2:59写道:
>
> +1.
>
> Thanks to Xiaoxiang for raising this; Kylin has some advanced but hidden
> feature. As the function becomes stable, we should enable them by default
> to benefit all users.
>
> Please also raise similar discussion if you wish to enable some good
> features.
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> Zhong, Yanghong <ya...@ebay.com.invalid> 于2019年3月18日周一 上午10:39写道:
>
> > +1.
> >
> > Best regards,
> > Yanghong Zhong
> >
> > On 2019/3/18, 10:27 AM, "Xiaoxiang Yu" <xi...@kyligence.io> wrote:
> >
> >     Dear all,
> >     I suggest enable "kylin.dictionary.shrunken-from-global-enabled" by
> > default(it is disabled by default), because I found enable it will speed up
> > cube build process when cube have count distinct(bitmap) on a large
> > cardinality column. This feature is contributed in KYLIN-3491.
> >
> >     When using count distinct(bitmap) measure on a large cardinality
> > column(this require global dictionary), build base cuboid step need
> > frequent cache swap so it cannot finished within a reasonable period.
> > KYLIN-3491 add a new step to build separated dictionary for each InputSplit
> > before BuildBaseCuboid step. So mapper of BuildBaseCuboid step only has to
> > fetch a smaller dictionary for itself(without unused value), instead of a
> > larger global dictionary. It will reduce cache swap and make
> > BuildBaseCuboid step run as quick as possible.
> >
> >     In my test env, my hadoop cluster is a CDH cluster with 56 vcore and
> > 110GB Memory. I create a model with a fact table (153326740 rows) and three
> > dimension tables, there are three count distinct(bitmap) measure which the
> > largest cardinality of single column is 55200325. With ShrunkenDict
> > disabled, the BuildBaseCuboid cannot completed in 22 hours. Comparatively,
> > with ShrunkenDict enabled, build process completed in a reasonable
> > duration(Extra Dictionary cost 5 minutes, Build Base Cuboid costs 5
> > minutes).
> >
> >
> > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F14030549%2F54363305-ad25e200-46a5-11e9-8bc7-fe2c385c0278.png&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C5f549f14059d4731d7a808d6ab4954ef%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636884728786178583&amp;sdata=KuUcbcerY42oG4J11G1jlEcIs4v%2BPPVt40B9G9fqa80%3D&amp;reserved=0
> >
> >     If you want know more, please check
> > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FKYLIN-3491&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C5f549f14059d4731d7a808d6ab4954ef%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636884728786178583&amp;sdata=T1P1rCA1munwUedC0PC4qttqbFqiDkda%2FZ%2BgqgkQn%2BE%3D&amp;reserved=0.
> > If you have any suggestion, please let me know.
> >
> >     ----------------
> >     Best wishes,
> >     Xiaoxiang Yu
> >
> >
> >
> >

Re: [Discussion] Enable shrunken dictionary by default

Posted by ShaoFeng Shi <sh...@apache.org>.
+1.

Thanks to Xiaoxiang for raising this; Kylin has some advanced but hidden
feature. As the function becomes stable, we should enable them by default
to benefit all users.

Please also raise similar discussion if you wish to enable some good
features.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Zhong, Yanghong <ya...@ebay.com.invalid> 于2019年3月18日周一 上午10:39写道:

> +1.
>
> Best regards,
> Yanghong Zhong
>
> On 2019/3/18, 10:27 AM, "Xiaoxiang Yu" <xi...@kyligence.io> wrote:
>
>     Dear all,
>     I suggest enable "kylin.dictionary.shrunken-from-global-enabled" by
> default(it is disabled by default), because I found enable it will speed up
> cube build process when cube have count distinct(bitmap) on a large
> cardinality column. This feature is contributed in KYLIN-3491.
>
>     When using count distinct(bitmap) measure on a large cardinality
> column(this require global dictionary), build base cuboid step need
> frequent cache swap so it cannot finished within a reasonable period.
> KYLIN-3491 add a new step to build separated dictionary for each InputSplit
> before BuildBaseCuboid step. So mapper of BuildBaseCuboid step only has to
> fetch a smaller dictionary for itself(without unused value), instead of a
> larger global dictionary. It will reduce cache swap and make
> BuildBaseCuboid step run as quick as possible.
>
>     In my test env, my hadoop cluster is a CDH cluster with 56 vcore and
> 110GB Memory. I create a model with a fact table (153326740 rows) and three
> dimension tables, there are three count distinct(bitmap) measure which the
> largest cardinality of single column is 55200325. With ShrunkenDict
> disabled, the BuildBaseCuboid cannot completed in 22 hours. Comparatively,
> with ShrunkenDict enabled, build process completed in a reasonable
> duration(Extra Dictionary cost 5 minutes, Build Base Cuboid costs 5
> minutes).
>
>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F14030549%2F54363305-ad25e200-46a5-11e9-8bc7-fe2c385c0278.png&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C5f549f14059d4731d7a808d6ab4954ef%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636884728786178583&amp;sdata=KuUcbcerY42oG4J11G1jlEcIs4v%2BPPVt40B9G9fqa80%3D&amp;reserved=0
>
>     If you want know more, please check
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FKYLIN-3491&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C5f549f14059d4731d7a808d6ab4954ef%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636884728786178583&amp;sdata=T1P1rCA1munwUedC0PC4qttqbFqiDkda%2FZ%2BgqgkQn%2BE%3D&amp;reserved=0.
> If you have any suggestion, please let me know.
>
>     ----------------
>     Best wishes,
>     Xiaoxiang Yu
>
>
>
>

Re: [Discussion] Enable shrunken dictionary by default

Posted by "Zhong, Yanghong" <ya...@ebay.com.INVALID>.
+1.

Best regards,
Yanghong Zhong

On 2019/3/18, 10:27 AM, "Xiaoxiang Yu" <xi...@kyligence.io> wrote:

    Dear all,
    I suggest enable "kylin.dictionary.shrunken-from-global-enabled" by default(it is disabled by default), because I found enable it will speed up cube build process when cube have count distinct(bitmap) on a large cardinality column. This feature is contributed in KYLIN-3491.
    
    When using count distinct(bitmap) measure on a large cardinality column(this require global dictionary), build base cuboid step need frequent cache swap so it cannot finished within a reasonable period. KYLIN-3491 add a new step to build separated dictionary for each InputSplit before BuildBaseCuboid step. So mapper of BuildBaseCuboid step only has to fetch a smaller dictionary for itself(without unused value), instead of a larger global dictionary. It will reduce cache swap and make BuildBaseCuboid step run as quick as possible.
    
    In my test env, my hadoop cluster is a CDH cluster with 56 vcore and 110GB Memory. I create a model with a fact table (153326740 rows) and three dimension tables, there are three count distinct(bitmap) measure which the largest cardinality of single column is 55200325. With ShrunkenDict disabled, the BuildBaseCuboid cannot completed in 22 hours. Comparatively, with ShrunkenDict enabled, build process completed in a reasonable duration(Extra Dictionary cost 5 minutes, Build Base Cuboid costs 5 minutes).
    
    https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F14030549%2F54363305-ad25e200-46a5-11e9-8bc7-fe2c385c0278.png&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C5f549f14059d4731d7a808d6ab4954ef%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636884728786178583&amp;sdata=KuUcbcerY42oG4J11G1jlEcIs4v%2BPPVt40B9G9fqa80%3D&amp;reserved=0
    
    If you want know more, please check https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FKYLIN-3491&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C5f549f14059d4731d7a808d6ab4954ef%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636884728786178583&amp;sdata=T1P1rCA1munwUedC0PC4qttqbFqiDkda%2FZ%2BgqgkQn%2BE%3D&amp;reserved=0. If you have any suggestion, please let me know.
    
    ----------------
    Best wishes,
    Xiaoxiang Yu