You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ryan Chan <ry...@gmail.com> on 2013/10/22 16:24:47 UTC

Spark Streaming - How to control the parallelism like storm

In storm, you can control the number of thread with the setSpout/setBolt,
and how to do the same with Spark Streaming?

e.g.

val lines = ssc.socketTextStream(args(1), args(2).toInt)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()


Sound like I cannot tell Spark to tell how many thread to be used with
`flatMap` and how many thread to be used with `map` etc, right?

Re: Spark Streaming - How to control the parallelism like storm

Posted by Aaron Davidson <il...@gmail.com>.
As Mark said, flatMap can only parallelize into as many partitions as exist
in the incoming RDD. socketTextStream() only produces 1 RDD at a time.
However, you can utilize the RDD.coalesce() method to split one RDD into
multiple (excuse the name; it can be used for shrinking or growing the
number of partitions), like so:

val lines = ssc.socketTextStream(args(1), args(2).toInt)
val partitionedLines = stream.transform(rdd => rdd.coalesce(10, shuffle =
true))
val words = partitionedLines.flatMap(_.split(" "))
...

This splits the incoming text stream into 10 partitions, so flatMap can run
up to 10x faster, assuming you have that many worker threads (and ignoring
the increased latency in partitioning the rdd across your nodes).


On Tue, Oct 22, 2013 at 8:21 AM, Mark Hamstra <ma...@clearstorydata.com>wrote:

> Not separately at the level of `flatMap` and `map`.  The number of
> partitions in the RDD those operations are working on determines the
> potential parallelism.  The number of worker cores available determines how
> much of that potential can be actualized.
>
>
> On Tue, Oct 22, 2013 at 7:24 AM, Ryan Chan <ry...@gmail.com> wrote:
>
>> In storm, you can control the number of thread with the setSpout/setBolt,
>> and how to do the same with Spark Streaming?
>>
>> e.g.
>>
>> val lines = ssc.socketTextStream(args(1), args(2).toInt)
>> val words = lines.flatMap(_.split(" "))
>> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
>> wordCounts.print()
>> ssc.start()
>>
>>
>> Sound like I cannot tell Spark to tell how many thread to be used with
>> `flatMap` and how many thread to be used with `map` etc, right?
>>
>>
>>
>

Re: Spark Streaming - How to control the parallelism like storm

Posted by Mark Hamstra <ma...@clearstorydata.com>.
Not separately at the level of `flatMap` and `map`.  The number of
partitions in the RDD those operations are working on determines the
potential parallelism.  The number of worker cores available determines how
much of that potential can be actualized.


On Tue, Oct 22, 2013 at 7:24 AM, Ryan Chan <ry...@gmail.com> wrote:

> In storm, you can control the number of thread with the setSpout/setBolt,
> and how to do the same with Spark Streaming?
>
> e.g.
>
> val lines = ssc.socketTextStream(args(1), args(2).toInt)
> val words = lines.flatMap(_.split(" "))
> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
> wordCounts.print()
> ssc.start()
>
>
> Sound like I cannot tell Spark to tell how many thread to be used with
> `flatMap` and how many thread to be used with `map` etc, right?
>
>
>