You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Ilamparithi M <ma...@gmail.com> on 2016/02/01 06:52:46 UTC

N Cuboids preparation MapReduce - Trying to avoid multiple stage read

Hi Kylin Team,

I am a new user of Apache Kylin who started exploring for our MOLAP
requirements.
Thanks so much for Kylin community for such an offering.

While looking at the way Kylin prepares the offline data,

Create Intermediate Flat Hive Table
Extract Fact Table Distinct Columns
Build Dimension Dictionary
Build Base Cuboid Data
Build N-Dimension Cuboid Data
Build N-Dimension Cuboid Data : N-1 Dimension
Build N-Dimension Cuboid Data : N-2 Dimension
Build N-Dimension Cuboid Data : N-3 Dimension
...
... 
Build N-Dimension Cuboid Data : 0-Dimension
Prepare HFile and BulkLoad into HBase.

Preparing each N-dimension cuboid happens as a sequence of MR jobs and
intermediate results are persisted into HDFS and read by the subsequent MR
job.

I am just thinking of the following approach :
Will it be optimal to combine all the MRs into one single MR job?.
Where we can
Read the Hive Table data only once as we do now,
As part of Mapper - Emit keys according to each N-Dimension cuboids with an
identifier field (Let say C_Id) indicating for which cuboid that key belongs
to.
As part of Reducer - Aggregate based on key for each cuboid and we can
differentiate keys with C_Id field.
Prepare the HFile and Load for all the aggregated data.

Will this help to avoid reading N Dimensional cuboid result as part of the
subsequent MapReduce job to prepare N-1 Dimensional cuboid and so on. This
will have same amount of Sort / Shuffle and Reduce groups in total, But this
will help to save read IO which is going to be the input for mapper.

When we are operating with larger number of dimensions, this may help us to
avoid multi stage read IO for mapper.

Kindly take a look and If I am missing anything - please point me to that.

Thanks & Regards,
Ilamparithi M.

--
View this message in context: http://apache-kylin.74782.x6.nabble.com/N-Cuboids-preparation-MapReduce-Trying-to-avoid-multiple-stage-read-tp3528.html
Sent from the Apache Kylin mailing list archive at Nabble.com.

Re: N Cuboids preparation MapReduce - Trying to avoid multiple stage read

Posted by ShaoFeng Shi <sh...@apache.org>.
Is your idea similar with the algorithm we called "fast-c ubing"? :
https://kylin.apache.org/blog/2015/08/15/fast-cubing/



2016-02-01 15:11 GMT+08:00 Ilamparithi M <ma...@gmail.com>:

> One more item i havn't specified is :
> Number of key_groups being sent to each subsequent mar reduce jobs.
> In the current design of Kylin, It is very optimal in terms of taking
> minimal key_groups to the next stage.
>
> But looking at the approach I was thinking about - ( Emiting cuboid based
> keys from one stage mapper with C_Id approach ), Combiner becomes a key as
> it would lead to group at mapper side and bringing down too many number of
> values to be transferred to reducer side.
>
> -Ilamparithi M.
>
> --
> View this message in context:
> http://apache-kylin.74782.x6.nabble.com/N-Cuboids-preparation-MapReduce-Trying-to-avoid-multiple-stage-read-tp3528p3531.html
> Sent from the Apache Kylin mailing list archive at Nabble.com.
>



-- 
Best regards,

Shaofeng Shi

Re: N Cuboids preparation MapReduce - Trying to avoid multiple stage read

Posted by Ilamparithi M <ma...@gmail.com>.
One more item i havn't specified is : 
Number of key_groups being sent to each subsequent mar reduce jobs.
In the current design of Kylin, It is very optimal in terms of taking
minimal key_groups to the next stage.

But looking at the approach I was thinking about - ( Emiting cuboid based
keys from one stage mapper with C_Id approach ), Combiner becomes a key as
it would lead to group at mapper side and bringing down too many number of
values to be transferred to reducer side.

-Ilamparithi M.

--
View this message in context: http://apache-kylin.74782.x6.nabble.com/N-Cuboids-preparation-MapReduce-Trying-to-avoid-multiple-stage-read-tp3528p3531.html
Sent from the Apache Kylin mailing list archive at Nabble.com.