You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shushant Arora <sh...@gmail.com> on 2015/07/21 05:31:35 UTC

spark streaming 1.3 coalesce on kafkadirectstream

does spark streaming 1.3 launches task for each partition offset range
whether that is 0 or not ?

If yes, how can I enforce it to not to launch tasks for empty rdds.Not able
t o use coalesce on directKafkaStream.

Shall we enforce repartitioning always before processing direct stream ?

use case is :

directKafkaStream.repartition(numexecutors).mapPartitions(new
FlatMapFunction<Iterator<Tuple2<byte[],byte[]>>, String>(){
...
}

Thanks

Re: spark streaming 1.3 coalesce on kafkadirectstream

Posted by Tathagata Das <td...@databricks.com>.

With DirectKafkaStream there are two approaches.
1. you increase the number of KAfka partitions Spark will automatically
read in parallel
2. if that's not possible, then explicitly repartition only if there are
more cores in the cluster than the number of Kafka partitions, AND the
first map-like state on the directKafkaDStream is heavy enough to warrant
the cost of repartitioning (shuffles the data around).

On Mon, Jul 20, 2015 at 8:31 PM, Shushant Arora <sh...@gmail.com>
wrote:

> does spark streaming 1.3 launches task for each partition offset range
> whether that is 0 or not ?
>
> If yes, how can I enforce it to not to launch tasks for empty rdds.Not
> able t o use coalesce on directKafkaStream.
>
> Shall we enforce repartitioning always before processing direct stream ?
>
> use case is :
>
> directKafkaStream.repartition(numexecutors).mapPartitions(new
> FlatMapFunction<Iterator<Tuple2<byte[],byte[]>>, String>(){
> ...
> }
>
> Thanks
>