You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "Shaofeng SHI (JIRA)" <ji...@apache.org> on 2017/12/21 09:58:00 UTC
[jira] [Commented] (KYLIN-3123) Improve Spark Cubing

    [ https://issues.apache.org/jira/browse/KYLIN-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299790#comment-16299790 ] 

Shaofeng SHI commented on KYLIN-3123:
-------------------------------------

There seems only 1 partition in the RDD, so no parallelism; Change this config to a smaller value like 50 as your cube has no "count distinct" measure:
kylin.engine.spark.rdd-partition-cut-mb=50

Besides, "minExecutors" is too big I think; many executors might be idle; you can set a smaller value to it.

Just take a try.

> Improve Spark Cubing
> --------------------
>
>                 Key: KYLIN-3123
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3123
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Spark Engine
>    Affects Versions: v2.2.0
>         Environment: HDP , Hbase, Spark 2.6, Centos7
>            Reporter: vu thanh dat
>              Labels: beginner
>             Fix For: v2.2.0
>
>         Attachments: dimension.bmp, measures.bmp, rowkeys.bmp, spark_so_slow_2.bmp
>
>
> Hi all,
> Im using Spark to bulid Kylin cube.
> Data is about 13 millions rows for one step. Partition by date, 10 dimension, no measures.
> I set config:
> kylin.storage.hbase.compression-codec=snappy
> kylin.engine.spark.rdd-partition-cut-mb=1000
> kylin.engine.spark.max-partition=5000
> kylin.engine.spark-conf.spark.master=yarn
> kylin.engine.spark-conf.spark.submit.deployMode=cluster
> kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
> kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=100
> kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=10240
> kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
> kylin.engine.spark-conf.spark.shuffle.service.enabled=true
> kylin.engine.spark-conf.spark.shuffle.service.port=7337
> kylin.engine.spark-conf.spark.yarn.queue=default
> kylin.engine.spark-conf.spark.executor.memory=4G
> kylin.engine.spark-conf.spark.executor.cores=4
> Step Build Cube with Spark so slow, about 1hour for this step, can you show me to custom kylin config for speed up this step. I have 30s servers centos, storage 5.87T and 448 cores.
> I'm attach my config.
> Best regards and thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)