You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bill Jay <bi...@gmail.com> on 2014/07/22 20:05:55 UTC

combineByKey at ShuffledDStream.scala

Hi all,

I am currently running a Spark Streaming program, which consumes data from
Kakfa and does the group by operation on the data. I try to optimize the
running time of the program because it looks slow to me. It seems the stage
named:

* combineByKey at ShuffledDStream.scala:42 *

always takes the longest running time. And If I open this stage, I only see
two executors on this stage. Does anyone has an idea what this stage does
and how to increase the speed for this stage? Thanks!

Bill

Re: combineByKey at ShuffledDStream.scala

Posted by Bill Jay <bi...@gmail.com>.
The streaming program contains the following main stages:

1. receive data from Kafka
2. preprocessing of the data. These are all map and filtering stages.
3. Group by a field
4. Process the groupBy results using map. Inside this processing, I use
collect, count.

Thanks!

Bill


On Tue, Jul 22, 2014 at 10:05 PM, Tathagata Das <tathagata.das1565@gmail.com
> wrote:

> Can you give an idea of the streaming program? Rest of the transformation
> you are doing on the input streams?
>
>
> On Tue, Jul 22, 2014 at 11:05 AM, Bill Jay <bi...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I am currently running a Spark Streaming program, which consumes data
>> from Kakfa and does the group by operation on the data. I try to optimize
>> the running time of the program because it looks slow to me. It seems the
>> stage named:
>>
>> * combineByKey at ShuffledDStream.scala:42 *
>>
>> always takes the longest running time. And If I open this stage, I only
>> see two executors on this stage. Does anyone has an idea what this stage
>> does and how to increase the speed for this stage? Thanks!
>>
>> Bill
>>
>
>

Re: combineByKey at ShuffledDStream.scala

Posted by Tathagata Das <ta...@gmail.com>.
Can you give an idea of the streaming program? Rest of the transformation
you are doing on the input streams?


On Tue, Jul 22, 2014 at 11:05 AM, Bill Jay <bi...@gmail.com>
wrote:

> Hi all,
>
> I am currently running a Spark Streaming program, which consumes data from
> Kakfa and does the group by operation on the data. I try to optimize the
> running time of the program because it looks slow to me. It seems the stage
> named:
>
> * combineByKey at ShuffledDStream.scala:42 *
>
> always takes the longest running time. And If I open this stage, I only
> see two executors on this stage. Does anyone has an idea what this stage
> does and how to increase the speed for this stage? Thanks!
>
> Bill
>