You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Jon Shoberg <jo...@gmail.com> on 2018/12/18 02:34:02 UTC

Spark tuning within Kylin? Article? Resource?

Is there a good/favorite article for tuning spark settings within Kylin?

I finally have Spark (2.1.3 as distributed with Kylin 2.5.2) running on my
systems.

My small data set (35M records) runs well the default settings.

My medium data set (4B records, 40GB compressed source file, 5 measures, 6
dimensions with low carnality) often dies at Step 3 (Extract Fact Table
Distinct Columns) with out of memory errors.

After using exceptionally large memory settings the job completed but I'm
trying to see if there is an optimization possible.

Any suggestions or ideas?  I've searched/read on spark tuning in general
but otherwise feeling I'm not making too much progress on optimizing with
the settings I've tried.

Thanks!J

回复： Spark tuning within Kylin? Article? Resource?

Posted by Chao Long <wa...@qq.com>.

Hi J,
There is a slide about Spark tunning in Apache Kylin(author shaofengshi)
https://www.slideshare.net/ShiShaoFeng1/spark-tunning-in-apache-kylin

About Step 3 (Extract Fact Table Distinct Columns) OOM, you can try to set this parameter "kylin.engine.mr.uhc-reducer-count" a larger value(default 1).

------------------
Best Regards,
Chao Long

------------------ 原始邮件 ------------------
发件人: "Jon Shoberg"<jo...@gmail.com>;
发送时间: 2018年12月18日(星期二) 中午11:16
收件人: "user"<us...@kylin.apache.org>;

主题: Re: Spark tuning within Kylin? Article? Resource?

Greatly appreciate the response.

I started there but after OOM errors I started to work on the settings for my test lab. After minimal success thought to ask if there was something more in-depth for tuning with other Kylin users found successful.

Right now I've gone to very basic configuration with dynamic allocation and see if I can avoid the late-stage OOM errors.

J

On Mon, Dec 17, 2018 at 7:44 PM JiaTao Tao <ta...@gmail.com> wrote:

Hope this may help: http://kylin.apache.org/docs/tutorial/cube_spark.html

Jon Shoberg <jo...@gmail.com> 于2018年12月18日周二 上午2:34写道：

Is there a good/favorite article for tuning spark settings within Kylin?

I finally have Spark (2.1.3 as distributed with Kylin 2.5.2) running on my systems.

My small data set (35M records) runs well the default settings.

My medium data set (4B records, 40GB compressed source file, 5 measures, 6 dimensions with low carnality) often dies at Step 3 (Extract Fact Table Distinct Columns) with out of memory errors.

After using exceptionally large memory settings the job completed but I'm trying to see if there is an optimization possible.

Any suggestions or ideas? I've searched/read on spark tuning in general but otherwise feeling I'm not making too much progress on optimizing with the settings I've tried.

Thanks!J

--

Regards!

Aron Tao

Re: Spark tuning within Kylin? Article? Resource?

Posted by Jon Shoberg <jo...@gmail.com>.

I was finally able to get a successful build using the following settings
....

There was a slideshare presentation on some performance settings:

https://www.slideshare.net/ShiShaoFeng1/spark-tunning-in-apache-kylin

Below is a section for #LocalTuning which uses settings from the
presentation above

I -think- the most meaningful one for me is the max-partition=500 which
came from the presentation.

After adding this the failing step was completed and I'm re-running
everything now.

The hardware is a 3 node, dual cpu, 128GB each (old Dell R710s) and data is
~4B records, 5 measure, 6 dimensions and low cardinality.


------------------------------------------

## Spark conf (default is in spark/conf/spark-defaults.conf)
#kylin.engine.spark-conf.spark.master=yarn
#kylin.engine.spark-conf.spark.submit.deployMode=cluster
#kylin.engine.spark-conf.spark.yarn.queue=default
#kylin.engine.spark-conf.spark.driver.memory=2G
#kylin.engine.spark-conf.spark.executor.memory=4G
#kylin.engine.spark-conf.spark.executor.instances=40
#kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
#kylin.engine.spark-conf.spark.shuffle.service.enabled=true
#kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history
#kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false

kylin.engine.spark-conf.spark.driver.extraClassPath=/opt/spark/jars/snappy*.jar
kylin.engine.spark-conf.spark.driver.extraLibraryPath=/opt/hadoop/lib/native
kylin.engine.spark-conf.spark.driver.extraLibraryPath=/opt/hadoop/lib/native
kylin.engine.spark-conf.spark.executor.extraLibraryPath=/opt/hadoop/lib/native
#
#### Spark conf for specific job
#kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
#kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
#
## manually upload spark-assembly jar to HDFS and then set this property
will avoid repeatedly uploading jar at runtime
##kylin.engine.spark-conf.spark.yarn.archive=hdfs://namenode:8020/kylin/spark/spark-libs.jar
##kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec

#LOCAL TUNING
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
kylin.engine.spark-conf.spark.max-partition=500
kylin.engine.spark-conf.spark.driver.memory=8G
kylin.engine.spark-conf.spark.executor.memory=8G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.cores=1
kylin.engine.spark-conf.spark.network.timeout=600
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=snappy
kylin.engine.spark-conf.spark.local.dir=/opt/volume/disk1/tmp
kylin.engine.spark-conf.spark.dynamicAllocation.schedulerBacklogTimeout=1


On Mon, Dec 17, 2018 at 8:23 PM Chao Long <wa...@qq.com> wrote:

> Hi J,
> There is a slide about Spark tunning in Apache Kylin(author shaofengshi)
> https://www.slideshare.net/ShiShaoFeng1/spark-tunning-in-apache-kylin
>
> About Step 3 (Extract Fact Table Distinct Columns) OOM, you can try to set
> this parameter "kylin.engine.mr.uhc-reducer-count" a larger value(default
> 1).
>
> ------------------
> Best Regards,
> Chao Long
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Jon Shoberg"<jo...@gmail.com>;
> *发送时间:* 2018年12月18日(星期二) 中午11:16
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: Spark tuning within Kylin? Article? Resource?
>
> Greatly appreciate the response.
>
> I started there but after OOM errors I started to work on the settings for
> my test lab. After minimal success thought to ask if there was something
> more in-depth for tuning with other Kylin users found successful.
>
> Right now I've gone to very basic configuration with dynamic allocation
> and see if I can avoid the late-stage OOM errors.
>
> J
>
> On Mon, Dec 17, 2018 at 7:44 PM JiaTao Tao <ta...@gmail.com> wrote:
>
>> Hope this may help: http://kylin.apache.org/docs/tutorial/cube_spark.html
>>
>> Jon Shoberg <jo...@gmail.com> 于2018年12月18日周二 上午2:34写道：
>>
>>> Is there a good/favorite article for tuning spark settings within Kylin?
>>>
>>> I finally have Spark (2.1.3 as distributed with Kylin 2.5.2) running on
>>> my systems.
>>>
>>> My small data set (35M records) runs well the default settings.
>>>
>>> My medium data set (4B records, 40GB compressed source file, 5 measures,
>>> 6 dimensions with low carnality) often dies at Step 3 (Extract Fact Table
>>> Distinct Columns) with out of memory errors.
>>>
>>> After using exceptionally large memory settings the job completed but
>>> I'm trying to see if there is an optimization possible.
>>>
>>> Any suggestions or ideas?  I've searched/read on spark tuning in general
>>> but otherwise feeling I'm not making too much progress on optimizing with
>>> the settings I've tried.
>>>
>>> Thanks!J
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>

回复： Spark tuning within Kylin? Article? Resource?

Posted by Chao Long <wa...@qq.com>.

Hi J,
There is a slide about Spark tunning in Apache Kylin(author shaofengshi)
https://www.slideshare.net/ShiShaoFeng1/spark-tunning-in-apache-kylin

About Step 3 (Extract Fact Table Distinct Columns) OOM, you can try to set this parameter "kylin.engine.mr.uhc-reducer-count" a larger value(default 1).

------------------
Best Regards,
Chao Long

------------------ 原始邮件 ------------------
发件人: "Jon Shoberg"<jo...@gmail.com>;
发送时间: 2018年12月18日(星期二) 中午11:16
收件人: "user"<us...@kylin.apache.org>;

主题: Re: Spark tuning within Kylin? Article? Resource?

Greatly appreciate the response.

I started there but after OOM errors I started to work on the settings for my test lab. After minimal success thought to ask if there was something more in-depth for tuning with other Kylin users found successful.

Right now I've gone to very basic configuration with dynamic allocation and see if I can avoid the late-stage OOM errors.

J

On Mon, Dec 17, 2018 at 7:44 PM JiaTao Tao <ta...@gmail.com> wrote:

Hope this may help: http://kylin.apache.org/docs/tutorial/cube_spark.html

Jon Shoberg <jo...@gmail.com> 于2018年12月18日周二 上午2:34写道：

Is there a good/favorite article for tuning spark settings within Kylin?

I finally have Spark (2.1.3 as distributed with Kylin 2.5.2) running on my systems.

My small data set (35M records) runs well the default settings.

My medium data set (4B records, 40GB compressed source file, 5 measures, 6 dimensions with low carnality) often dies at Step 3 (Extract Fact Table Distinct Columns) with out of memory errors.

After using exceptionally large memory settings the job completed but I'm trying to see if there is an optimization possible.

Any suggestions or ideas? I've searched/read on spark tuning in general but otherwise feeling I'm not making too much progress on optimizing with the settings I've tried.

Thanks!J

--

Regards!

Aron Tao

Re: Spark tuning within Kylin? Article? Resource?

Posted by Jon Shoberg <jo...@gmail.com>.

Greatly appreciate the response.

I started there but after OOM errors I started to work on the settings for
my test lab. After minimal success thought to ask if there was something
more in-depth for tuning with other Kylin users found successful.

Right now I've gone to very basic configuration with dynamic allocation and
see if I can avoid the late-stage OOM errors.

J

On Mon, Dec 17, 2018 at 7:44 PM JiaTao Tao <ta...@gmail.com> wrote:

> Hope this may help: http://kylin.apache.org/docs/tutorial/cube_spark.html
>
> Jon Shoberg <jo...@gmail.com> 于2018年12月18日周二 上午2:34写道：
>
>> Is there a good/favorite article for tuning spark settings within Kylin?
>>
>> I finally have Spark (2.1.3 as distributed with Kylin 2.5.2) running on
>> my systems.
>>
>> My small data set (35M records) runs well the default settings.
>>
>> My medium data set (4B records, 40GB compressed source file, 5 measures,
>> 6 dimensions with low carnality) often dies at Step 3 (Extract Fact Table
>> Distinct Columns) with out of memory errors.
>>
>> After using exceptionally large memory settings the job completed but I'm
>> trying to see if there is an optimization possible.
>>
>> Any suggestions or ideas?  I've searched/read on spark tuning in general
>> but otherwise feeling I'm not making too much progress on optimizing with
>> the settings I've tried.
>>
>> Thanks!J
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>

Re: Spark tuning within Kylin? Article? Resource?

Posted by JiaTao Tao <ta...@gmail.com>.

Hope this may help: http://kylin.apache.org/docs/tutorial/cube_spark.html

Jon Shoberg <jo...@gmail.com> 于2018年12月18日周二 上午2:34写道：

> Is there a good/favorite article for tuning spark settings within Kylin?
>
> I finally have Spark (2.1.3 as distributed with Kylin 2.5.2) running on my
> systems.
>
> My small data set (35M records) runs well the default settings.
>
> My medium data set (4B records, 40GB compressed source file, 5 measures, 6
> dimensions with low carnality) often dies at Step 3 (Extract Fact Table
> Distinct Columns) with out of memory errors.
>
> After using exceptionally large memory settings the job completed but I'm
> trying to see if there is an optimization possible.
>
> Any suggestions or ideas?  I've searched/read on spark tuning in general
> but otherwise feeling I'm not making too much progress on optimizing with
> the settings I've tried.
>
> Thanks!J
>


-- 


Regards!

Aron Tao