You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by zhangliyun <ke...@126.com> on 2019/12/20 05:26:12 UTC

How to estimate the rdd size before the rdd result is written to disk

Hi all:
 i want to ask a question  about how to estimate the rdd size( according to byte) when it is not saved to disk because the job spends long time if the output is very huge and output partition number is small. 




following step is  what i can solve for this problem 

 1.sample 0.01 's original data

 2.compute sample data count

 3. if sample data count >0, cache the sample data  and compute sample data size

 4.compute original rdd total count

 5.estimate the rdd size as ${total count}* ${sampel data size}  / ${sample rdd count}



The code is here.  


My question
1. can i use above way to solve the problem?   If can not, where is wrong?
2. Is there any existed solution ( existed API in spark) to solve the problem?






Best Regards
Kelly Zhang

optimising cluster performance

Posted by Sriram Bhamidipati <sr...@gmail.com>.
Hi All
Sorry, earlier, I forgot to set the subject line correctly

> Hello Experts
> I am trying to maximise the resource utilisation on my 3 node spark
> cluster (2 data nodes and 1 driver) so that the job finishes quickest. I am
> trying to create a benchmark so I can recommend an optimal POD for the job
> 128GB x 16 cores
> I have standalone spark running 2.4.0
> HTOP shows only half of the memory is in use. So what will be alternatives
> I can try? CPU is always 100 % for the allocated resources
> I can reduce per executor memory to 32 GB and increase number of
> executors?
> I have the following properties:
>
> spark.driver.maxResultSize 64g
> spark.driver.memory 100g
> spark.driver.port 33631
> spark.dynamicAllocation.enabled true
> spark.dynamicAllocation.executorIdleTimeout 60s
> spark.executor.cores 8
> spark.executor.id driver
> spark.executor.instances 4
> spark.executor.memory 64g
> spark.files file://dist/xxxx-0.0.1-py3.7.egg
> spark.locality.wait 10s
>
> 100
> spark.shuffle.service.enabled true
>
> On Fri, Dec 20, 2019 at 10:56 AM zhangliyun <ke...@126.com> wrote:
>
>> Hi all:
>>  i want to ask a question  about how to estimate the rdd size( according
>> to byte) when it is not saved to disk because the job spends long time if
>> the output is very huge and output partition number is small.
>>
>>
>> following step is  what i can solve for this problem
>>
>>  1.sample 0.01 's original data
>>
>>  2.compute sample data count
>>
>>  3. if sample data count >0, cache the sample data  and compute sample
>> data size
>>
>>  4.compute original rdd total count
>>
>>  5.estimate the rdd size as ${total count}* ${sampel data size}  /
>> ${sample rdd count}
>>
>> The code is here
>> <https://github.com/kellyzly/sparkcode/blob/master/EstimateDataSetSize.scala#L24>
>> .
>>
>> My question
>> 1. can i use above way to solve the problem?   If can not, where is wrong?
>> 2. Is there any existed solution ( existed API in spark) to solve the
>> problem?
>>
>>
>>
>> Best Regards
>> Kelly Zhang
>>
>>
>>
>>
>
>
> --
> -Sriram
>


-- 
-Sriram

Re: How to estimate the rdd size before the rdd result is written to disk

Posted by Sriram Bhamidipati <sr...@gmail.com>.
Hello Experts
I am trying to maximise the resource utilisation on my 3 node spark cluster
(2 data nodes and 1 driver) so that the job finishes quickest. I am trying
to create a benchmark so I can recommend an optimal POD for the job
128GB x 16 cores
I have standalone spark running 2.4.0
HTOP shows only half of the memory is in use. So what will be alternatives
I can try? CPU is always 100 % for the allocated resources
I can reduce per executor memory to 32 GB and increase number of executors?
I have the following properties:

spark.driver.maxResultSize 64g
spark.driver.memory 100g
spark.driver.port 33631
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout 60s
spark.executor.cores 8
spark.executor.id driver
spark.executor.instances 4
spark.executor.memory 64g
spark.files file://dist/xxxx-0.0.1-py3.7.egg
spark.locality.wait 10s

100
spark.shuffle.service.enabled true

On Fri, Dec 20, 2019 at 10:56 AM zhangliyun <ke...@126.com> wrote:

> Hi all:
>  i want to ask a question  about how to estimate the rdd size( according
> to byte) when it is not saved to disk because the job spends long time if
> the output is very huge and output partition number is small.
>
>
> following step is  what i can solve for this problem
>
>  1.sample 0.01 's original data
>
>  2.compute sample data count
>
>  3. if sample data count >0, cache the sample data  and compute sample
> data size
>
>  4.compute original rdd total count
>
>  5.estimate the rdd size as ${total count}* ${sampel data size}  /
> ${sample rdd count}
>
> The code is here
> <https://github.com/kellyzly/sparkcode/blob/master/EstimateDataSetSize.scala#L24>
> .
>
> My question
> 1. can i use above way to solve the problem?   If can not, where is wrong?
> 2. Is there any existed solution ( existed API in spark) to solve the
> problem?
>
>
>
> Best Regards
> Kelly Zhang
>
>
>
>


-- 
-Sriram