You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Jon Shoberg <jo...@gmail.com> on 2018/12/10 20:34:08 UTC

Best Practice? Build dimensions in Kylin or Hive?

Question ... is it better to build dimensions in Kylin or Hive?

Source data arrives as bzip files, ~67 of them totaling 40GB compressed and
35B records.

Previously I've been working in Hive to separate source data into a star
schema:

   - Load bzip files to HDFS
   - Connect Hive to files as external table
   - Script the creation of five dimensions
   - Script the creation of a final fact table

Within Kylin I setup the table joins to reach the dimension values to the
fact table.

However, scaling out to the full data and seeing how Steps 1 - 4 create
intermediate data the above work seems redundant.

Would it be more efficient to let Kylin build the star schema such as:

   - Load bzip files to HDFS
   - Connect hive to files as external table
   - Move data to sequencefile with comrpession (Kylin seems to work best
   with sequencefiles)
   - In the cube build select dimension and fact columns from source data
   - Let Kylin intermediate processing and further steps organize the
   source data to fnish the cube build

  Is it worthwhile to create the tables of dimension values and a fully
normlaized fact table before going into cube design?

  Or is it 'better' do to everything in the Kylin cube design given that my
source data ultimately has all the required values (no external joins).

  One the data set is process its going to be static with no further
updates.  Analysis is likely done via Kylin ODBC with Tableau and/or custom
app to be developed.

Thanks! J

Re: Best Practice? Build dimensions in Kylin or Hive?

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Jon,

The dimension table is for reusing across different scenarios and ease of
maintenance. If you don't have those requirements, you can just keep them
in the fact table. Kylin supports a single fact table as well.

Kylin's first 1 or 2 steps seems to be redundant for some cases, but it is
to simplify the subsequent processing. For example, the table is a virtual
view, or in a new file format which doesn't be supported by MapReduce; with
materializing them into a consistent file format, the subsequent processing
can be much simpler.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Jon Shoberg <jo...@gmail.com> 于2018年12月11日周二 上午4:34写道：

> Question ... is it better to build dimensions in Kylin or Hive?
>
> Source data arrives as bzip files, ~67 of them totaling 40GB compressed
> and 35B records.
>
> Previously I've been working in Hive to separate source data into a star
> schema:
>
>    - Load bzip files to HDFS
>    - Connect Hive to files as external table
>    - Script the creation of five dimensions
>    - Script the creation of a final fact table
>
> Within Kylin I setup the table joins to reach the dimension values to the
> fact table.
>
> However, scaling out to the full data and seeing how Steps 1 - 4 create
> intermediate data the above work seems redundant.
>
> Would it be more efficient to let Kylin build the star schema such as:
>
>    - Load bzip files to HDFS
>    - Connect hive to files as external table
>    - Move data to sequencefile with comrpession (Kylin seems to work best
>    with sequencefiles)
>    - In the cube build select dimension and fact columns from source data
>    - Let Kylin intermediate processing and further steps organize the
>    source data to fnish the cube build
>
>   Is it worthwhile to create the tables of dimension values and a fully
> normlaized fact table before going into cube design?
>
>   Or is it 'better' do to everything in the Kylin cube design given that
> my source data ultimately has all the required values (no external joins).
>
>   One the data set is process its going to be static with no further
> updates.  Analysis is likely done via Kylin ODBC with Tableau and/or custom
> app to be developed.
>
> Thanks! J
>
>
>