You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Li Yang <li...@apache.org> on 2015/07/01 12:45:19 UTC

about new cube algorithm in 0.8

We've got good test result about the new cubing algorithm, ~50% reduced
build time according to initial test. The implementation (KYLIN-607) is on
branch 0.8.
- Test 1, 128 MB input, cube size 3.3 GB, expansion rate 26, build time
reduced from 112 min to 49 min.
- Test 2, 50 GB input, cube size 1.6 GB, expansion rate 0.03, build time
reduced from 75 min to 45 min.

The new cube build has following steps.

#1 Create Intermediate Flat Hive Table
   - Same as before
#2 Extract Fact Table Distinct Columns
   - One round MR that scans full input
   - Extract distinct values of each column that requires dictionary (same
as before)
   - Estimate cuboid size using HyperLogLog
#3 Save Cuboid Statistics
   - Saves stats collected in #2 to metadata store
#4 Build Dimension Dictionary
   - Same as before
#5 Create HTable
   - Create HTable according to stats collected in #2
#6 Build Cube
   - One round MR to calculate whole cube from input.
   - Each mapper takes a split and calculate all cuboids in memory. The
output is like a cube segment built from the split.
   - Each reducer corresponds to a HTable region, get inputs from mappers
and do a final round of aggregate if mapper key space overlaps.
#7 Step Name: Load HFile to HBase Table
#8 Step Name: Update Cube Info
#9 Step Name: Garbage Collection

Discussions on why new build is faster.

   - The new algorithm reduces shuffling a lot, because aggregation first
happens in mapper and then the aggregated result is shuffled to reducer. In
the current algorithm, it's the records BEFORE aggregation gets shuffled.
   - Only two MR jobs, saves some MR overhead especially if your cluster is
busy.
   - Mapper becomes CPU intensive task because it does the in-mem cubing.
   - Mapper splits are cut very small, say 10 MB each. Because of cube
expansion, 10 MB input may already yield 2.5 GB output bytes without
compression.
   - The new build creates HFile directly in #6, versus in the current
build, cuboids are saved in HDFS first and an additional step is required
to convert HFile.

Will continue to do more thorough testing on the new algorithm with larger
data sets.

Cheers
Yang

Re: about new cube algorithm in 0.8

Posted by Luke Han <lu...@gmail.com>.
Awesome, very nice job! So happy to bought you guys ice cream today;-)

For people who are using 0.7.x version: we are working on back port this to
0.7 branch now, will release under 0.7.3 or 0.7.4 version after testing
with our internal real cases, with more confident.

Since 0.8.x is still very early version, we do not recommend to use 0.8.x
version for your production case.

Thanks.




Best Regards!
---------------------

Luke Han

On Wed, Jul 1, 2015 at 6:45 PM, Li Yang <li...@apache.org> wrote:

> We've got good test result about the new cubing algorithm, ~50% reduced
> build time according to initial test. The implementation (KYLIN-607) is on
> branch 0.8.
> - Test 1, 128 MB input, cube size 3.3 GB, expansion rate 26, build time
> reduced from 112 min to 49 min.
> - Test 2, 50 GB input, cube size 1.6 GB, expansion rate 0.03, build time
> reduced from 75 min to 45 min.
>
> The new cube build has following steps.
>
> #1 Create Intermediate Flat Hive Table
>    - Same as before
> #2 Extract Fact Table Distinct Columns
>    - One round MR that scans full input
>    - Extract distinct values of each column that requires dictionary (same
> as before)
>    - Estimate cuboid size using HyperLogLog
> #3 Save Cuboid Statistics
>    - Saves stats collected in #2 to metadata store
> #4 Build Dimension Dictionary
>    - Same as before
> #5 Create HTable
>    - Create HTable according to stats collected in #2
> #6 Build Cube
>    - One round MR to calculate whole cube from input.
>    - Each mapper takes a split and calculate all cuboids in memory. The
> output is like a cube segment built from the split.
>    - Each reducer corresponds to a HTable region, get inputs from mappers
> and do a final round of aggregate if mapper key space overlaps.
> #7 Step Name: Load HFile to HBase Table
> #8 Step Name: Update Cube Info
> #9 Step Name: Garbage Collection
>
> Discussions on why new build is faster.
>
>    - The new algorithm reduces shuffling a lot, because aggregation first
> happens in mapper and then the aggregated result is shuffled to reducer. In
> the current algorithm, it's the records BEFORE aggregation gets shuffled.
>    - Only two MR jobs, saves some MR overhead especially if your cluster is
> busy.
>    - Mapper becomes CPU intensive task because it does the in-mem cubing.
>    - Mapper splits are cut very small, say 10 MB each. Because of cube
> expansion, 10 MB input may already yield 2.5 GB output bytes without
> compression.
>    - The new build creates HFile directly in #6, versus in the current
> build, cuboids are saved in HDFS first and an additional step is required
> to convert HFile.
>
> Will continue to do more thorough testing on the new algorithm with larger
> data sets.
>
> Cheers
> Yang
>