You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Sonny Heer <so...@gmail.com> on 2017/12/20 05:00:36 UTC

Extract Fact Table Distinct Columns Step

can someone explain what step 3 does?

specifically how it relates dimensions, measures, and row keys.  our input
fact table is abou 234 million records and this step is taking forever.

we have 450gb memory with 25 slots per node, which is about 225
concurrently running slots, and its still taking a while.

 The doc just talks about looking at optimize cube, but that page talks
about hierarchy columns and derived columns.  we dont have any lookup
tables so no derived and there is no natural hierarchy

Just trying to find what item controls why this step takes longer vs
shorter time wise.

Thanks

Re: Extract Fact Table Distinct Columns Step

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Sonny,

If the mappers are similarly slow, it likely indicates there are too many
cuboids (dimension combination) for the cube; Could you please let me know
your dimension number, and how you distribute them to the aggregation
groups? Try to optimize the design with mandatory/joint/hierarchy as much
as possible, according to your query pattern and data characteristics.

2017-12-20 14:25 GMT+08:00 Sonny Heer <so...@gmail.com>:

> Hi ShaoFeng,  thanks for quick response.  Kylin version 1.6.
>
> The step is #3 and it takes the longest time in the Map phase.
> sort/shuffle and reduce seem to be ok.  Yes we went through that document.
>  The input mappers are set to about 1.1 million giving us 225 mappers for
> input of 234 million records.  All mappers run at the same time since that
> is the number of mapper slots we have.  The mappers all seem to take the
> same amount of time (we didn't notice any long runners in the end).
>
> the m/r stats output for that step is below.  Troubling is the 4.6 billion
> output records from map phase.  So is there a general place we can look for
> "Extract Fact Table Distinct Columns"  step.    Thanks
>
>
> Map-Reduce Framework
> 		Map input records=234707850
> 		Map output records=4687531086 <0468%20753%201086>
> 		Map output bytes=49568802916
> 		Map output materialized bytes=9852827353
> 		Input split bytes=965025
> 		Combine input records=4687531086 <0468%20753%201086>
> 		Combine output records=33878243
> 		Reduce input groups=281301
> 		Reduce shuffle bytes=9852827353
> 		Reduce input records=33878243
> 		Reduce output records=0
> 		Spilled Records=67756486
> 		Shuffled Maps =5850
> 		Failed Shuffles=0
> 		Merged Map outputs=5850
> 		GC time elapsed (ms)=49602314
> 		CPU time spent (ms)=759218400
> 		Physical memory (bytes) snapshot=418766036992
> 		Virtual memory (bytes) snapshot=898566012928
> 		Total committed heap usage (bytes)=391907901440
>
>
> On Tue, Dec 19, 2017 at 10:13 PM, ShaoFeng Shi <sh...@apache.org>
> wrote:
>
>> Hi Sonny,
>>
>> Did you check this document, which has the description of each step:
>> https://kylin.apache.org/docs21/howto/howto_optimize_build.html
>>
>> Besides, what's your Kylin version? and did you check the MR job progress
>> to see which stage is the most expensive, map or reduce, and what's the
>> number of the mappers and reducers; Are all mapper/reducers take a similar
>> time, or some specific took much longer than others?
>>
>> Furthermore, for deep div, please provide the cube definition; We need to
>> know the dimension number, aggregation groups,  encodings method as well as
>> other possible factors.
>>
>> 2017-12-20 13:00 GMT+08:00 Sonny Heer <so...@gmail.com>:
>>
>>> can someone explain what step 3 does?
>>>
>>> specifically how it relates dimensions, measures, and row keys.  our
>>> input fact table is abou 234 million records and this step is taking
>>> forever.
>>>
>>> we have 450gb memory with 25 slots per node, which is about 225
>>> concurrently running slots, and its still taking a while.
>>>
>>>  The doc just talks about looking at optimize cube, but that page talks
>>> about hierarchy columns and derived columns.  we dont have any lookup
>>> tables so no derived and there is no natural hierarchy
>>>
>>> Just trying to find what item controls why this step takes longer vs
>>> shorter time wise.
>>>
>>> Thanks
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Extract Fact Table Distinct Columns Step

Posted by Sonny Heer <so...@gmail.com>.

Hi ShaoFeng,  thanks for quick response.  Kylin version 1.6.

The step is #3 and it takes the longest time in the Map phase.
sort/shuffle and reduce seem to be ok.  Yes we went through that document.
 The input mappers are set to about 1.1 million giving us 225 mappers for
input of 234 million records.  All mappers run at the same time since that
is the number of mapper slots we have.  The mappers all seem to take the
same amount of time (we didn't notice any long runners in the end).

the m/r stats output for that step is below.  Troubling is the 4.6 billion
output records from map phase.  So is there a general place we can look for
"Extract Fact Table Distinct Columns"  step.    Thanks


Map-Reduce Framework
		Map input records=234707850
		Map output records=4687531086
		Map output bytes=49568802916
		Map output materialized bytes=9852827353
		Input split bytes=965025
		Combine input records=4687531086
		Combine output records=33878243
		Reduce input groups=281301
		Reduce shuffle bytes=9852827353
		Reduce input records=33878243
		Reduce output records=0
		Spilled Records=67756486
		Shuffled Maps =5850
		Failed Shuffles=0
		Merged Map outputs=5850
		GC time elapsed (ms)=49602314
		CPU time spent (ms)=759218400
		Physical memory (bytes) snapshot=418766036992
		Virtual memory (bytes) snapshot=898566012928
		Total committed heap usage (bytes)=391907901440


On Tue, Dec 19, 2017 at 10:13 PM, ShaoFeng Shi <sh...@apache.org>
wrote:

> Hi Sonny,
>
> Did you check this document, which has the description of each step:
> https://kylin.apache.org/docs21/howto/howto_optimize_build.html
>
> Besides, what's your Kylin version? and did you check the MR job progress
> to see which stage is the most expensive, map or reduce, and what's the
> number of the mappers and reducers; Are all mapper/reducers take a similar
> time, or some specific took much longer than others?
>
> Furthermore, for deep div, please provide the cube definition; We need to
> know the dimension number, aggregation groups,  encodings method as well as
> other possible factors.
>
> 2017-12-20 13:00 GMT+08:00 Sonny Heer <so...@gmail.com>:
>
>> can someone explain what step 3 does?
>>
>> specifically how it relates dimensions, measures, and row keys.  our
>> input fact table is abou 234 million records and this step is taking
>> forever.
>>
>> we have 450gb memory with 25 slots per node, which is about 225
>> concurrently running slots, and its still taking a while.
>>
>>  The doc just talks about looking at optimize cube, but that page talks
>> about hierarchy columns and derived columns.  we dont have any lookup
>> tables so no derived and there is no natural hierarchy
>>
>> Just trying to find what item controls why this step takes longer vs
>> shorter time wise.
>>
>> Thanks
>>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: Extract Fact Table Distinct Columns Step

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Sonny,

Did you check this document, which has the description of each step:
https://kylin.apache.org/docs21/howto/howto_optimize_build.html

Besides, what's your Kylin version? and did you check the MR job progress
to see which stage is the most expensive, map or reduce, and what's the
number of the mappers and reducers; Are all mapper/reducers take a similar
time, or some specific took much longer than others?

Furthermore, for deep div, please provide the cube definition; We need to
know the dimension number, aggregation groups,  encodings method as well as
other possible factors.

2017-12-20 13:00 GMT+08:00 Sonny Heer <so...@gmail.com>:

> can someone explain what step 3 does?
>
> specifically how it relates dimensions, measures, and row keys.  our input
> fact table is abou 234 million records and this step is taking forever.
>
> we have 450gb memory with 25 slots per node, which is about 225
> concurrently running slots, and its still taking a while.
>
>  The doc just talks about looking at optimize cube, but that page talks
> about hierarchy columns and derived columns.  we dont have any lookup
> tables so no derived and there is no natural hierarchy
>
> Just trying to find what item controls why this step takes longer vs
> shorter time wise.
>
> Thanks
>



-- 
Best regards,

Shaofeng Shi 史少锋