You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by JF Chen <da...@gmail.com> on 2018/11/07 07:27:48 UTC

How to increase the parallelism of Spark Streaming application？

I have a Spark Streaming application which reads data from kafka and save
the the transformation result to hdfs.
My original partition number of kafka topic is 8, and repartition the data
to 100 to increase the parallelism of spark job.
Now I am wondering if I increase the kafka partition number to 100 instead
of setting repartition to 100, will the performance be enhanced? (I know
repartition action cost a lot cpu resource)
If I set the kafka partition number to 100, does it have any negative
efficiency?
I just have one production environment so it's not convenient for me to do
the test....

Thanks!

Regard,
Junfeng Chen

Re: How to increase the parallelism of Spark Streaming application？

Posted by JF Chen <da...@gmail.com>.

Hi,
I have test it on my production environment, and I find a strange thing.
After I set the kafka partition to 100, some tasks are executed very fast,
but some are slow. The slow ones cost double time than fast ones(from event
timeline). However, I have checked the consumer offsets, the data amount
for each task should be similar, so it should be no unbalance problem.
Any one have some good idea?

Regard,
Junfeng Chen


On Thu, Nov 8, 2018 at 12:34 AM Shahbaz <sh...@gmail.com> wrote:

> Hi ,
>
>    - Do you have adequate CPU cores allocated to handle increased
>    partitions ,generally if you have Kafka partitions >=(greater than or equal
>    to) CPU Cores Total (Number of Executor Instances * Per Executor Core)
>    ,gives increased task parallelism for reader phase.
>    - However if you have too many partitions but not enough cores ,it
>    would eventually slow down the reader (Ex: 100 Partitions and only 20 Total
>    Cores).
>    - Additionally ,the next set of transformation will have there own
>    partitions ,if its involving  shuffle ,sq.shuffle.partitions then defines
>    next level of parallelism ,if you are not having any data skew,then you
>    should get good performance.
>
>
> Regards,
> Shahbaz
>
> On Wed, Nov 7, 2018 at 12:58 PM JF Chen <da...@gmail.com> wrote:
>
>> I have a Spark Streaming application which reads data from kafka and save
>> the the transformation result to hdfs.
>> My original partition number of kafka topic is 8, and repartition the
>> data to 100 to increase the parallelism of spark job.
>> Now I am wondering if I increase the kafka partition number to 100
>> instead of setting repartition to 100, will the performance be enhanced? (I
>> know repartition action cost a lot cpu resource)
>> If I set the kafka partition number to 100, does it have any negative
>> efficiency?
>> I just have one production environment so it's not convenient for me to
>> do the test....
>>
>> Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>

Re: How to increase the parallelism of Spark Streaming application？

Posted by JF Chen <da...@gmail.com>.

Yes, now I have allocated 100 cores and 8 kafka partitions, and then
repartition it to 100 to feed 100 cores. In following stage I have map
action, will it also cause slow down?

Regard,
Junfeng Chen


On Thu, Nov 8, 2018 at 12:34 AM Shahbaz <sh...@gmail.com> wrote:

> Hi ,
>
>    - Do you have adequate CPU cores allocated to handle increased
>    partitions ,generally if you have Kafka partitions >=(greater than or equal
>    to) CPU Cores Total (Number of Executor Instances * Per Executor Core)
>    ,gives increased task parallelism for reader phase.
>    - However if you have too many partitions but not enough cores ,it
>    would eventually slow down the reader (Ex: 100 Partitions and only 20 Total
>    Cores).
>    - Additionally ,the next set of transformation will have there own
>    partitions ,if its involving  shuffle ,sq.shuffle.partitions then defines
>    next level of parallelism ,if you are not having any data skew,then you
>    should get good performance.
>
>
> Regards,
> Shahbaz
>
> On Wed, Nov 7, 2018 at 12:58 PM JF Chen <da...@gmail.com> wrote:
>
>> I have a Spark Streaming application which reads data from kafka and save
>> the the transformation result to hdfs.
>> My original partition number of kafka topic is 8, and repartition the
>> data to 100 to increase the parallelism of spark job.
>> Now I am wondering if I increase the kafka partition number to 100
>> instead of setting repartition to 100, will the performance be enhanced? (I
>> know repartition action cost a lot cpu resource)
>> If I set the kafka partition number to 100, does it have any negative
>> efficiency?
>> I just have one production environment so it's not convenient for me to
>> do the test....
>>
>> Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>

Re: How to increase the parallelism of Spark Streaming application？

Posted by Shahbaz <sh...@gmail.com>.

Hi ,

   - Do you have adequate CPU cores allocated to handle increased
   partitions ,generally if you have Kafka partitions >=(greater than or equal
   to) CPU Cores Total (Number of Executor Instances * Per Executor Core)
   ,gives increased task parallelism for reader phase.
   - However if you have too many partitions but not enough cores ,it would
   eventually slow down the reader (Ex: 100 Partitions and only 20 Total
   Cores).
   - Additionally ,the next set of transformation will have there own
   partitions ,if its involving  shuffle ,sq.shuffle.partitions then defines
   next level of parallelism ,if you are not having any data skew,then you
   should get good performance.


Regards,
Shahbaz

On Wed, Nov 7, 2018 at 12:58 PM JF Chen <da...@gmail.com> wrote:

> I have a Spark Streaming application which reads data from kafka and save
> the the transformation result to hdfs.
> My original partition number of kafka topic is 8, and repartition the data
> to 100 to increase the parallelism of spark job.
> Now I am wondering if I increase the kafka partition number to 100 instead
> of setting repartition to 100, will the performance be enhanced? (I know
> repartition action cost a lot cpu resource)
> If I set the kafka partition number to 100, does it have any negative
> efficiency?
> I just have one production environment so it's not convenient for me to do
> the test....
>
> Thanks!
>
> Regard,
> Junfeng Chen
>

Re: How to increase the parallelism of Spark Streaming application？

Posted by JF Chen <da...@gmail.com>.

Memory is not a big problem for me... SO  no any other bad effect?

Regard,
Junfeng Chen


On Wed, Nov 7, 2018 at 4:51 PM Michael Shtelma <ms...@gmail.com> wrote:

> If you configure to many Kafka partitions, you can run into memory issues.
> This will increase memory requirements for spark job a lot.
>
> Best,
> Michael
>
>
> On Wed, Nov 7, 2018 at 8:28 AM JF Chen <da...@gmail.com> wrote:
>
>> I have a Spark Streaming application which reads data from kafka and save
>> the the transformation result to hdfs.
>> My original partition number of kafka topic is 8, and repartition the
>> data to 100 to increase the parallelism of spark job.
>> Now I am wondering if I increase the kafka partition number to 100
>> instead of setting repartition to 100, will the performance be enhanced? (I
>> know repartition action cost a lot cpu resource)
>> If I set the kafka partition number to 100, does it have any negative
>> efficiency?
>> I just have one production environment so it's not convenient for me to
>> do the test....
>>
>> Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>

Re: How to increase the parallelism of Spark Streaming application？

Posted by vincent gromakowski <vi...@gmail.com>.

On the other side increasing parallelism with kakfa partition avoid the
shuffle in spark to repartition

Le mer. 7 nov. 2018 à 09:51, Michael Shtelma <ms...@gmail.com> a écrit :

> If you configure to many Kafka partitions, you can run into memory issues.
> This will increase memory requirements for spark job a lot.
>
> Best,
> Michael
>
>
> On Wed, Nov 7, 2018 at 8:28 AM JF Chen <da...@gmail.com> wrote:
>
>> I have a Spark Streaming application which reads data from kafka and save
>> the the transformation result to hdfs.
>> My original partition number of kafka topic is 8, and repartition the
>> data to 100 to increase the parallelism of spark job.
>> Now I am wondering if I increase the kafka partition number to 100
>> instead of setting repartition to 100, will the performance be enhanced? (I
>> know repartition action cost a lot cpu resource)
>> If I set the kafka partition number to 100, does it have any negative
>> efficiency?
>> I just have one production environment so it's not convenient for me to
>> do the test....
>>
>> Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>

Re: How to increase the parallelism of Spark Streaming application？

Posted by Michael Shtelma <ms...@gmail.com>.

If you configure to many Kafka partitions, you can run into memory issues.
This will increase memory requirements for spark job a lot.

Best,
Michael


On Wed, Nov 7, 2018 at 8:28 AM JF Chen <da...@gmail.com> wrote:

> I have a Spark Streaming application which reads data from kafka and save
> the the transformation result to hdfs.
> My original partition number of kafka topic is 8, and repartition the data
> to 100 to increase the parallelism of spark job.
> Now I am wondering if I increase the kafka partition number to 100 instead
> of setting repartition to 100, will the performance be enhanced? (I know
> repartition action cost a lot cpu resource)
> If I set the kafka partition number to 100, does it have any negative
> efficiency?
> I just have one production environment so it's not convenient for me to do
> the test....
>
> Thanks!
>
> Regard,
> Junfeng Chen
>