You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by hit-lacus <hi...@126.com> on 2019/06/23 05:16:14 UTC

回复:Re: Problem with Cube

Hi,
   It looks like it is caused by data skew, which offten happen in many big data scene. As far as I know, I think you should check the high cardinality colmun and use it as a "Shard By" column (in "Advanced Setting" of cube design stage). You may check "Redistribute intermediate table" in http://kylin.apache.org/docs20/howto/howto_optimize_build.html for more information.
   If you find anything wrong or I misunderstand anything, please let me know. Thank you.






-----------------
-----------------
Best wishes to you ! 
From :Xiaoxiang Yu

At 2019-06-22 02:33:56, "Cinto Sunny" <ci...@gmail.com> wrote:

Thanks. We actually have 12 reducers. The problem is that one reducer is getting stuck with huge data. The rest completes. We have a 1.8 billion dsids and not sure if that is problem. If yes, how do we distribute the data


- Cinto




On Fri, Jun 21, 2019 at 12:03 AM Chao Long <ch...@gmail.com> wrote:

Hi Cinto Sunny,
   You can try to set "kylin.engine.mr.uhc-reducer-count" a bigger value, default is 1.


On Fri, Jun 21, 2019 at 2:44 PM Cinto Sunny <ci...@gmail.com> wrote:

Hi All,


I am building a cube with 10 dimensions and two measures. The total input size is 100 GB. 
I am trying to build using Roaring BitMap. One of the fact is user and has ~1.8B userids. 


The build is getting stuck at stage - Extract Fact Table Distinct Columns. One executor is stuck and is processing over 800M lines.


I am using version - 2.6.


Any pointers would be appreciated. Let me know is any further information is required.


- Cinto

Re: Problem with Cube

Posted by Cinto <ci...@gmail.com>.
Thanks. We checked and confirmed that the user ids are all going to one reducer and causing the skewness. If we increase the reducer count, does this mean the dsid count step will also take multiple reducers ?

Thanks for the help. We will try this anways

Sent from my iPhone

> On Jun 25, 2019, at 2:32 AM, ShaoFeng Shi <sh...@apache.org> wrote:
> 
> Hi Cinto,
> 
> By default, Kylin uses one reducer for one column to remove the duplicated values (for building dimension dictionaries). This is okay for the normal case. 
> 
> In your case, the user id (dsids) is an ultra-high-cardinality column, so one reducer is insufficient to process, Kylin needs to start more reducers for it. As you already observed that this reducer is very slow, you can adjust the configuration to increase the parallelism. e.g:
> 
> kylin.engine.mr.uhc-reducer-count=10
> 
> To take this effective, you need to restart Kylin, discard the current job and re-submit the build job.
> 
> Best regards,
> 
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
> 
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> 
> 
> 
> 
> hit-lacus <hi...@126.com> 于2019年6月23日周日 下午1:16写道:
>> Hi,
>>    It looks like it is caused by data skew, which offten happen in many big data scene. As far as I know, I think you should check the high cardinality colmun and use it as a "Shard By" column (in "Advanced Setting" of cube design stage). You may check "Redistribute intermediate table" in http://kylin.apache.org/docs20/howto/howto_optimize_build.html for more information.
>>    If you find anything wrong or I misunderstand anything, please let me know. Thank you.
>> 
>> 
>> 
>> -----------------
>> -----------------
>> Best wishes to you ! 
>> From :Xiaoxiang Yu
>> 
>> At 2019-06-22 02:33:56, "Cinto Sunny" <ci...@gmail.com> wrote:
>> Thanks. We actually have 12 reducers. The problem is that one reducer is getting stuck with huge data. The rest completes. We have a 1.8 billion dsids and not sure if that is problem. If yes, how do we distribute the data
>> 
>> - Cinto
>> 
>> 
>>> On Fri, Jun 21, 2019 at 12:03 AM Chao Long <ch...@gmail.com> wrote:
>>> Hi Cinto Sunny,
>>>    You can try to set "kylin.engine.mr.uhc-reducer-count" a bigger value, default is 1.
>>> 
>>>> On Fri, Jun 21, 2019 at 2:44 PM Cinto Sunny <ci...@gmail.com> wrote:
>>>> Hi All,
>>>> 
>>>> I am building a cube with 10 dimensions and two measures. The total input size is 100 GB. 
>>>> I am trying to build using Roaring BitMap. One of the fact is user and has ~1.8B userids. 
>>>> 
>>>> The build is getting stuck at stage - Extract Fact Table Distinct Columns. One executor is stuck and is processing over 800M lines.
>>>> 
>>>> I am using version - 2.6.
>>>> 
>>>> Any pointers would be appreciated. Let me know is any further information is required.
>>>> 
>>>> - Cinto

Re: Re: Problem with Cube

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Cinto,

By default, Kylin uses one reducer for one column to remove the duplicated
values (for building dimension dictionaries). This is okay for the normal
case.

In your case, the user id (dsids) is an ultra-high-cardinality column, so
one reducer is insufficient to process, Kylin needs to start more reducers
for it. As you already observed that this reducer is very slow, you can
adjust the configuration to increase the parallelism. e.g:

kylin.engine.mr.uhc-reducer-count=10

To take this effective, you need to restart Kylin, discard the current job
and re-submit the build job.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




hit-lacus <hi...@126.com> 于2019年6月23日周日 下午1:16写道:

> Hi,
>    It looks like it is caused by data skew, which offten happen in many big
> data scene. As far as I know, I think you should check the high
> cardinality colmun and use it as a "Shard By" column (in "Advanced
> Setting" of cube design stage). You may check "Redistribute intermediate
> table" in http://kylin.apache.org/docs20/howto/howto_optimize_build.html for
> more information.
>    If you find anything wrong or I misunderstand anything, please let me
> know. Thank you.
>
>
>
> *-----------------*
> *-----------------*
> *Best wishes to you ! *
> *From :**Xiaoxiang Yu*
>
> At 2019-06-22 02:33:56, "Cinto Sunny" <ci...@gmail.com> wrote:
>
> Thanks. We actually have 12 reducers. The problem is that one reducer is
> getting stuck with huge data. The rest completes. We have a 1.8 billion
> dsids and not sure if that is problem. If yes, how do we distribute the data
>
> - Cinto
>
>
> On Fri, Jun 21, 2019 at 12:03 AM Chao Long <ch...@gmail.com>
> wrote:
>
>> Hi Cinto Sunny,
>>    You can try to set "kylin.engine.mr.uhc-reducer-count" a bigger value,
>> default is 1.
>>
>> On Fri, Jun 21, 2019 at 2:44 PM Cinto Sunny <ci...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am building a cube with 10 dimensions and two measures. The total
>>> input size is 100 GB.
>>> I am trying to build using Roaring BitMap. One of the fact is user and
>>> has ~1.8B userids.
>>>
>>> The build is getting stuck at stage - Extract Fact Table Distinct
>>> Columns. One executor is stuck and is processing over 800M lines.
>>>
>>> I am using version - 2.6.
>>>
>>> Any pointers would be appreciated. Let me know is any
>>> further information is required.
>>>
>>> - Cinto
>>>
>>