You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "tanyao (Jira)" <ji...@apache.org> on 2021/10/15 02:49:00 UTC
[jira] [Updated] (KYLIN-5099) parquet file size is too big in
kylin4 with spark3 than kylin3 with mr
[ https://issues.apache.org/jira/browse/KYLIN-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
tanyao updated KYLIN-5099:
--------------------------
Description:
hi ,
i am trying to use spark 3.1.1 as the build engine in kylin4.0, the hive table has 200W+ rows with orc type, and there are 10 dimensions definded, the original size is about 50M.
when i use kylin4.0 to build this cube ,the final parquet files size all together is 1G+,that is to say , a single segment is about 1G+. However , i use the same hive table data with the same cube model and dimensions , the hbase segment size is just 100M+,
why this happened? And the building time in kylin4.0 is not faster then kylin3.1 , even worse! both of them take about 10mins, i can not find the benefits about kylin4.0
!image-2021-10-15-10-43-54-830.png!
!image-2021-10-15-10-44-49-178.png!
was:
hi ,
i am trying to use spark 3.1.1 as the build engine in kylin4.0, the hive table has 200W+ rows with orc type, and there are 10 dimentions definded, the original size is about 50M.
when i use kylin4.0 to build this cube ,the final parquet files size all together area 1G+,that is to say , a singal segment is about 1G+. However , i use the same hive table data with the same cube model and dimentions , the hbase segment size is just 100M+,
why this happened?
!image-2021-10-15-10-43-54-830.png!
!image-2021-10-15-10-44-49-178.png!
> parquet file size is too big in kylin4 with spark3 than kylin3 with mr
> ----------------------------------------------------------------------
>
> Key: KYLIN-5099
> URL: https://issues.apache.org/jira/browse/KYLIN-5099
> Project: Kylin
> Issue Type: Bug
> Affects Versions: v4.0.0
> Reporter: tanyao
> Priority: Major
> Attachments: image-2021-10-15-10-43-54-830.png, image-2021-10-15-10-44-49-178.png
>
>
> hi ,
> i am trying to use spark 3.1.1 as the build engine in kylin4.0, the hive table has 200W+ rows with orc type, and there are 10 dimensions definded, the original size is about 50M.
> when i use kylin4.0 to build this cube ,the final parquet files size all together is 1G+,that is to say , a single segment is about 1G+. However , i use the same hive table data with the same cube model and dimensions , the hbase segment size is just 100M+,
> why this happened? And the building time in kylin4.0 is not faster then kylin3.1 , even worse! both of them take about 10mins, i can not find the benefits about kylin4.0
> !image-2021-10-15-10-43-54-830.png!
>
> !image-2021-10-15-10-44-49-178.png!
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)