You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by Chao Long <wa...@qq.com> on 2018/12/21 02:24:13 UTC

回复:Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min

Hi,
  If the data have an even distribution, you can set "kylin.source.hive.redistribute-flat-table=false" to skip Step 2. And about Step 3, if you have many UHC dimension, you can set "kylin.engine.mr.uhc-reducer-count" a larger value to use more reducer to handle dict.


------------------
Best Regards,
Chao Long


------------------ 原始邮件 ------------------
发件人: "Jon Shoberg"<jo...@gmail.com>;
发送时间: 2018年12月20日(星期四) 晚上10:20
收件人: "user"<us...@kylin.apache.org>;

主题: Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min



Question ...

  Is there a way to optimize the first three steps of a Kylin build?


  Total build time of a development cube is 626 minutes and a break down by steps:

87  min - Create Intermediate Flat Hive Table

207 min -  Redistribute Flat Hive Table

248 min -  Extract Fact Table Distinct Columns

0   min

0   min

62  min -  Build Cube with Spark

19  min -  Convert Cuboid Data to HFile

0   min

0   min

0   min

0   min
   The data set is summary files (~35M records) and detail files (~4B records - 40GB compressed).


   There is a join needed for the final data which is handled in a view within hive.  So I do expect a performance cost there.


   However, staging the data other ways (loading to sequence/org file vs external table to bz2 files) there is no net-gain.


   This means, pre-processing the data externally can make Kylin run a little faster but the overall time from absolute start to finish is still ~600min.


   Steps 1/2 seem to be a redundancy given how my data is structured; the hsql/sql commands Kylin sends to Hive could be done before the build process.


   Is it possible to optimize steps 1/2/3? Is it possible to skip steps 1/2 and jump to step 3 if the data was staged as-needed/correctly beforehand?


   My guess is there are mostly 'no' answers where (which is fine) but thought I'd ask.


   (The test lab is getting doubled in size today so I'm not ultimately worried but I'm seeking other improvements vs. only adding hardware and networking)


Thanks! J

回复: 回复:Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min

Posted by Chao Long <wa...@qq.com>.
An even distribution means there is not a skew distribution. If data skew happen, there may some task's execution time are very larger then average time. And the RedistributeFlatHiveTableStep is to avoid data skew as far as possible, for more details you can see 
https://issues.apache.org/jira/browse/KYLIN-1656
https://issues.apache.org/jira/browse/KYLIN-1677


And the parameter "kylin.engine.mr.uhc-reducer-count" work for Mapreduce and Spark. In Spark, a larger value means allocate more tasks. About what value should it be, I think you can see the task execution state of "Extract Fact Table Distinct Columns" job in Spark UI and identify the most time consuming task and give this parameter a suitable value. And about what exactly it is, I don't know.



------------------
Best Regards,
Chao Long


------------------ 原始邮件 ------------------
发件人: "Jon Shoberg"<jo...@gmail.com>;
发送时间: 2018年12月21日(星期五) 上午10:34
收件人: "user"<us...@kylin.apache.org>;

主题: Re: 回复:Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min



That’s great to know about step 2!

How would you define or determine an even distribution? This is a four node Hdfs cluster and the bz2 files as the data source (external table) have a dfs distribution of 2. I’d imagine the distribution would not be horrible on a small cluster. 


On the reducer could this is a spark setup. So on yarn I see this step running as a spark job. Does a mar reduce setting such as this apply? If so what is a larger value. I think the default here is 1 ... should it be 2,5,10,or 100? It’s a 4 node cluster with 10 cpus and ~550gb ram. 

Sent from my iPhoneX

On Dec 20, 2018, at 7:24 PM, Chao Long <wa...@qq.com> wrote:


Hi,
  If the data have an even distribution, you can set "kylin.source.hive.redistribute-flat-table=false" to skip Step 2. And about Step 3, if you have many UHC dimension, you can set "kylin.engine.mr.uhc-reducer-count" a larger value to use more reducer to handle dict.


------------------
Best Regards,
Chao Long


------------------ 原始邮件 ------------------
发件人: "Jon Shoberg"<jo...@gmail.com>;
发送时间: 2018年12月20日(星期四) 晚上10:20
收件人: "user"<us...@kylin.apache.org>;

主题: Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min



Question ...

  Is there a way to optimize the first three steps of a Kylin build?


  Total build time of a development cube is 626 minutes and a break down by steps:

87  min - Create Intermediate Flat Hive Table

207 min -  Redistribute Flat Hive Table

248 min -  Extract Fact Table Distinct Columns

0   min

0   min

62  min -  Build Cube with Spark

19  min -  Convert Cuboid Data to HFile

0   min

0   min

0   min

0   min
   The data set is summary files (~35M records) and detail files (~4B records - 40GB compressed).


   There is a join needed for the final data which is handled in a view within hive.  So I do expect a performance cost there.


   However, staging the data other ways (loading to sequence/org file vs external table to bz2 files) there is no net-gain.


   This means, pre-processing the data externally can make Kylin run a little faster but the overall time from absolute start to finish is still ~600min.


   Steps 1/2 seem to be a redundancy given how my data is structured; the hsql/sql commands Kylin sends to Hive could be done before the build process.


   Is it possible to optimize steps 1/2/3? Is it possible to skip steps 1/2 and jump to step 3 if the data was staged as-needed/correctly beforehand?


   My guess is there are mostly 'no' answers where (which is fine) but thought I'd ask.


   (The test lab is getting doubled in size today so I'm not ultimately worried but I'm seeking other improvements vs. only adding hardware and networking)


Thanks! J

Re: 回复:Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min

Posted by Jon Shoberg <jo...@gmail.com>.
That’s great to know about step 2!

How would you define or determine an even distribution? This is a four node Hdfs cluster and the bz2 files as the data source (external table) have a dfs distribution of 2. I’d imagine the distribution would not be horrible on a small cluster. 

On the reducer could this is a spark setup. So on yarn I see this step running as a spark job. Does a mar reduce setting such as this apply? If so what is a larger value. I think the default here is 1 ... should it be 2,5,10,or 100? It’s a 4 node cluster with 10 cpus and ~550gb ram. 

Sent from my iPhoneX

> On Dec 20, 2018, at 7:24 PM, Chao Long <wa...@qq.com> wrote:
> 
> Hi,
>   If the data have an even distribution, you can set "kylin.source.hive.redistribute-flat-table=false" to skip Step 2. And about Step 3, if you have many UHC dimension, you can set "kylin.engine.mr.uhc-reducer-count" a larger value to use more reducer to handle dict.
> 
> ------------------
> Best Regards,
> Chao Long
> ------------------ 原始邮件 ------------------
> 发件人: "Jon Shoberg"<jo...@gmail.com>;
> 发送时间: 2018年12月20日(星期四) 晚上10:20
> 收件人: "user"<us...@kylin.apache.org>;
> 主题: Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min
> 
> Question ...
> 
>   Is there a way to optimize the first three steps of a Kylin build?
> 
>   Total build time of a development cube is 626 minutes and a break down by steps:
> 87  min - Create Intermediate Flat Hive Table
> 207 min -  Redistribute Flat Hive Table
> 248 min -  Extract Fact Table Distinct Columns
> 0   min
> 0   min
> 62  min -  Build Cube with Spark
> 19  min -  Convert Cuboid Data to HFile
> 0   min
> 0   min
> 0   min
> 0   min
>    The data set is summary files (~35M records) and detail files (~4B records - 40GB compressed).
> 
>    There is a join needed for the final data which is handled in a view within hive.  So I do expect a performance cost there.
> 
>    However, staging the data other ways (loading to sequence/org file vs external table to bz2 files) there is no net-gain.
> 
>    This means, pre-processing the data externally can make Kylin run a little faster but the overall time from absolute start to finish is still ~600min.
> 
>    Steps 1/2 seem to be a redundancy given how my data is structured; the hsql/sql commands Kylin sends to Hive could be done before the build process.
> 
>    Is it possible to optimize steps 1/2/3? Is it possible to skip steps 1/2 and jump to step 3 if the data was staged as-needed/correctly beforehand?
> 
>    My guess is there are mostly 'no' answers where (which is fine) but thought I'd ask.
> 
>    (The test lab is getting doubled in size today so I'm not ultimately worried but I'm seeking other improvements vs. only adding hardware and networking)
> 
> Thanks! J
>  
>    
> 
> 
> 
> 
>