You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by varun sharma <va...@gmail.com> on 2015/10/29 07:27:49 UTC

Need more tasks in KafkaDirectStream

Right now, there is one to one correspondence between kafka partitions and
spark partitions.
I dont have a requirement of one to one semantics.
I need more tasks to be generated in the job so that it can be parallelised
and batch can be completed fast. In the previous Receiver based approach
number of tasks created were independent of kafka partitions, I need
something like that only.
Any config available if I dont need one to one semantics?
Is there any way I can repartition without incurring any additional cost.

Thanks
*VARUN SHARMA*

Re: Need more tasks in KafkaDirectStream

Posted by Dibyendu Bhattacharya <di...@gmail.com>.
If you do not need one to one semantics and does not want strict ordering
guarantee , you can very well use the Receiver based approach, and this
consumer from Spark-Packages (
https://github.com/dibbhatt/kafka-spark-consumer) can give much better
alternatives in terms of performance and reliability  for Receiver based
approach.

Regards,
Dibyendu

On Thu, Oct 29, 2015 at 11:57 AM, varun sharma <va...@gmail.com>
wrote:

> Right now, there is one to one correspondence between kafka partitions and
> spark partitions.
> I dont have a requirement of one to one semantics.
> I need more tasks to be generated in the job so that it can be
> parallelised and batch can be completed fast. In the previous Receiver
> based approach number of tasks created were independent of kafka
> partitions, I need something like that only.
> Any config available if I dont need one to one semantics?
> Is there any way I can repartition without incurring any additional cost.
>
> Thanks
> *VARUN SHARMA*
>
>

Re: Need more tasks in KafkaDirectStream

Posted by varun sharma <va...@gmail.com>.
Cody, adding partitions to kafka is there as a last resort, I was wondering
if I can decrease the processing time by not touching my Kafka cluster.
Adrian, repartition looks like a good option and let me check if I can gain
performance.
Dibyendu, will surely try out this consumer.

Thanks all, will share my findings..

On Thu, Oct 29, 2015 at 7:16 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Consuming from kafka is inherently limited to using a number of consumer
> nodes less than or equal to the number of kafka partitions.  If you think
> about it, you're going to be paying some network cost to repartition that
> data from a consumer to different processing nodes, regardless of what
> Spark consumer library you use.
>
> If you really need finer grained parallelism, and want to do it in a more
> efficient manner, you need to move that partitioning to the producer (i.e.
> add more partitions to kafka).
>
> On Thu, Oct 29, 2015 at 6:11 AM, Adrian Tanase <at...@adobe.com> wrote:
>
>> You can call .repartition on the Dstream created by the Kafka direct
>> consumer. You take the one-time hit of a shuffle but gain the ability to
>> scale out processing beyond your number of partitions.
>>
>> We’re doing this to scale up from 36 partitions / topic to 140 partitions
>> (20 cores * 7 nodes) and it works great.
>>
>> -adrian
>>
>> From: varun sharma
>> Date: Thursday, October 29, 2015 at 8:27 AM
>> To: user
>> Subject: Need more tasks in KafkaDirectStream
>>
>> Right now, there is one to one correspondence between kafka partitions
>> and spark partitions.
>> I dont have a requirement of one to one semantics.
>> I need more tasks to be generated in the job so that it can be
>> parallelised and batch can be completed fast. In the previous Receiver
>> based approach number of tasks created were independent of kafka
>> partitions, I need something like that only.
>> Any config available if I dont need one to one semantics?
>> Is there any way I can repartition without incurring any additional cost.
>>
>> Thanks
>> *VARUN SHARMA*
>>
>>
>


-- 
*VARUN SHARMA*
*Flipkart*
*Bangalore*

Re: Need more tasks in KafkaDirectStream

Posted by Cody Koeninger <co...@koeninger.org>.
Consuming from kafka is inherently limited to using a number of consumer
nodes less than or equal to the number of kafka partitions.  If you think
about it, you're going to be paying some network cost to repartition that
data from a consumer to different processing nodes, regardless of what
Spark consumer library you use.

If you really need finer grained parallelism, and want to do it in a more
efficient manner, you need to move that partitioning to the producer (i.e.
add more partitions to kafka).

On Thu, Oct 29, 2015 at 6:11 AM, Adrian Tanase <at...@adobe.com> wrote:

> You can call .repartition on the Dstream created by the Kafka direct
> consumer. You take the one-time hit of a shuffle but gain the ability to
> scale out processing beyond your number of partitions.
>
> We’re doing this to scale up from 36 partitions / topic to 140 partitions
> (20 cores * 7 nodes) and it works great.
>
> -adrian
>
> From: varun sharma
> Date: Thursday, October 29, 2015 at 8:27 AM
> To: user
> Subject: Need more tasks in KafkaDirectStream
>
> Right now, there is one to one correspondence between kafka partitions and
> spark partitions.
> I dont have a requirement of one to one semantics.
> I need more tasks to be generated in the job so that it can be
> parallelised and batch can be completed fast. In the previous Receiver
> based approach number of tasks created were independent of kafka
> partitions, I need something like that only.
> Any config available if I dont need one to one semantics?
> Is there any way I can repartition without incurring any additional cost.
>
> Thanks
> *VARUN SHARMA*
>
>

Re: Need more tasks in KafkaDirectStream

Posted by Adrian Tanase <at...@adobe.com>.
You can call .repartition on the Dstream created by the Kafka direct consumer. You take the one-time hit of a shuffle but gain the ability to scale out processing beyond your number of partitions.

We’re doing this to scale up from 36 partitions / topic to 140 partitions (20 cores * 7 nodes) and it works great.

-adrian

From: varun sharma
Date: Thursday, October 29, 2015 at 8:27 AM
To: user
Subject: Need more tasks in KafkaDirectStream

Right now, there is one to one correspondence between kafka partitions and spark partitions.
I dont have a requirement of one to one semantics.
I need more tasks to be generated in the job so that it can be parallelised and batch can be completed fast. In the previous Receiver based approach number of tasks created were independent of kafka partitions, I need something like that only.
Any config available if I dont need one to one semantics?
Is there any way I can repartition without incurring any additional cost.

Thanks
VARUN SHARMA