You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Jon Shoberg <jo...@gmail.com> on 2018/12/20 14:20:50 UTC

Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min

Question ...

  Is there a way to optimize the first three steps of a Kylin build?

  Total build time of a development cube is 626 minutes and a break down by
steps:

   1. 87  min - Create Intermediate Flat Hive Table
   2. 207 min -  Redistribute Flat Hive Table
   3. 248 min -  Extract Fact Table Distinct Columns
   4. 0   min
   5. 0   min
   6. 62  min -  Build Cube with Spark
   7. 19  min -  Convert Cuboid Data to HFile
   8. 0   min
   9. 0   min
   10. 0   min
   11. 0   min

   The data set is summary files (~35M records) and detail files (~4B
records - 40GB compressed).

   There is a join needed for the final data which is handled in a view
within hive.  So I do expect a performance cost there.

   However, staging the data other ways (loading to sequence/org file vs
external table to bz2 files) there is no net-gain.

   This means, pre-processing the data externally can make Kylin run a
little faster but the overall time from absolute start to finish is still
~600min.

   Steps 1/2 seem to be a redundancy given how my data is structured; the
hsql/sql commands Kylin sends to Hive could be done before the build
process.

   Is it possible to optimize steps 1/2/3? Is it possible to skip steps 1/2
and jump to step 3 if the data was staged as-needed/correctly beforehand?

   My guess is there are mostly 'no' answers where (which is fine) but
thought I'd ask.

   (The test lab is getting doubled in size today so I'm not ultimately
worried but I'm seeking other improvements vs. only adding hardware and
networking)

Thanks! J