You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sriram Bhamidipati <sr...@gmail.com> on 2019/12/20 06:25:58 UTC

Re: How to estimate the rdd size before the rdd result is written to disk

Hello Experts
I am trying to maximise the resource utilisation on my 3 node spark cluster
(2 data nodes and 1 driver) so that the job finishes quickest. I am trying
to create a benchmark so I can recommend an optimal POD for the job
128GB x 16 cores
I have standalone spark running 2.4.0
HTOP shows only half of the memory is in use. So what will be alternatives
I can try? CPU is always 100 % for the allocated resources
I can reduce per executor memory to 32 GB and increase number of executors?
I have the following properties:

spark.driver.maxResultSize 64g
spark.driver.memory 100g
spark.driver.port 33631
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout 60s
spark.executor.cores 8
spark.executor.id driver
spark.executor.instances 4
spark.executor.memory 64g
spark.files file://dist/xxxx-0.0.1-py3.7.egg
spark.locality.wait 10s

100
spark.shuffle.service.enabled true

On Fri, Dec 20, 2019 at 10:56 AM zhangliyun <ke...@126.com> wrote:

> Hi all:
>  i want to ask a question  about how to estimate the rdd size( according
> to byte) when it is not saved to disk because the job spends long time if
> the output is very huge and output partition number is small.
>
>
> following step is  what i can solve for this problem
>
>  1.sample 0.01 's original data
>
>  2.compute sample data count
>
>  3. if sample data count >0, cache the sample data  and compute sample
> data size
>
>  4.compute original rdd total count
>
>  5.estimate the rdd size as ${total count}* ${sampel data size}  /
> ${sample rdd count}
>
> The code is here
> <https://github.com/kellyzly/sparkcode/blob/master/EstimateDataSetSize.scala#L24>
> .
>
> My question
> 1. can i use above way to solve the problem?   If can not, where is wrong?
> 2. Is there any existed solution ( existed API in spark) to solve the
> problem?
>
>
>
> Best Regards
> Kelly Zhang
>
>
>
>


-- 
-Sriram

optimising cluster performance

Posted by Sriram Bhamidipati <sr...@gmail.com>.

Hi All
Sorry, earlier, I forgot to set the subject line correctly

> Hello Experts
> I am trying to maximise the resource utilisation on my 3 node spark
> cluster (2 data nodes and 1 driver) so that the job finishes quickest. I am
> trying to create a benchmark so I can recommend an optimal POD for the job
> 128GB x 16 cores
> I have standalone spark running 2.4.0
> HTOP shows only half of the memory is in use. So what will be alternatives
> I can try? CPU is always 100 % for the allocated resources
> I can reduce per executor memory to 32 GB and increase number of
> executors?
> I have the following properties:
>
> spark.driver.maxResultSize 64g
> spark.driver.memory 100g
> spark.driver.port 33631
> spark.dynamicAllocation.enabled true
> spark.dynamicAllocation.executorIdleTimeout 60s
> spark.executor.cores 8
> spark.executor.id driver
> spark.executor.instances 4
> spark.executor.memory 64g
> spark.files file://dist/xxxx-0.0.1-py3.7.egg
> spark.locality.wait 10s
>
> 100
> spark.shuffle.service.enabled true
>
> On Fri, Dec 20, 2019 at 10:56 AM zhangliyun <ke...@126.com> wrote:
>
>> Hi all:
>>  i want to ask a question  about how to estimate the rdd size( according
>> to byte) when it is not saved to disk because the job spends long time if
>> the output is very huge and output partition number is small.
>>
>>
>> following step is  what i can solve for this problem
>>
>>  1.sample 0.01 's original data
>>
>>  2.compute sample data count
>>
>>  3. if sample data count >0, cache the sample data  and compute sample
>> data size
>>
>>  4.compute original rdd total count
>>
>>  5.estimate the rdd size as ${total count}* ${sampel data size}  /
>> ${sample rdd count}
>>
>> The code is here
>> <https://github.com/kellyzly/sparkcode/blob/master/EstimateDataSetSize.scala#L24>
>> .
>>
>> My question
>> 1. can i use above way to solve the problem?   If can not, where is wrong?
>> 2. Is there any existed solution ( existed API in spark) to solve the
>> problem?
>>
>>
>>
>> Best Regards
>> Kelly Zhang
>>
>>
>>
>>
>
>
> --
> -Sriram
>


-- 
-Sriram