You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sriram Bhamidipati <sr...@gmail.com> on 2019/12/20 06:46:28 UTC
optimising cluster performance

Hi All
Sorry, earlier, I forgot to set the subject line correctly

> Hello Experts
> I am trying to maximise the resource utilisation on my 3 node spark
> cluster (2 data nodes and 1 driver) so that the job finishes quickest. I am
> trying to create a benchmark so I can recommend an optimal POD for the job
> 128GB x 16 cores
> I have standalone spark running 2.4.0
> HTOP shows only half of the memory is in use. So what will be alternatives
> I can try? CPU is always 100 % for the allocated resources
> I can reduce per executor memory to 32 GB and increase number of
> executors?
> I have the following properties:
>
> spark.driver.maxResultSize 64g
> spark.driver.memory 100g
> spark.driver.port 33631
> spark.dynamicAllocation.enabled true
> spark.dynamicAllocation.executorIdleTimeout 60s
> spark.executor.cores 8
> spark.executor.id driver
> spark.executor.instances 4
> spark.executor.memory 64g
> spark.files file://dist/xxxx-0.0.1-py3.7.egg
> spark.locality.wait 10s
>
> 100
> spark.shuffle.service.enabled true
>
> On Fri, Dec 20, 2019 at 10:56 AM zhangliyun <ke...@126.com> wrote:
>
>> Hi all:
>>  i want to ask a question  about how to estimate the rdd size( according
>> to byte) when it is not saved to disk because the job spends long time if
>> the output is very huge and output partition number is small.
>>
>>
>> following step is  what i can solve for this problem
>>
>>  1.sample 0.01 's original data
>>
>>  2.compute sample data count
>>
>>  3. if sample data count >0, cache the sample data  and compute sample
>> data size
>>
>>  4.compute original rdd total count
>>
>>  5.estimate the rdd size as ${total count}* ${sampel data size}  /
>> ${sample rdd count}
>>
>> The code is here
>> <https://github.com/kellyzly/sparkcode/blob/master/EstimateDataSetSize.scala#L24>
>> .
>>
>> My question
>> 1. can i use above way to solve the problem?   If can not, where is wrong?
>> 2. Is there any existed solution ( existed API in spark) to solve the
>> problem?
>>
>>
>>
>> Best Regards
>> Kelly Zhang
>>
>>
>>
>>
>
>
> --
> -Sriram
>


-- 
-Sriram