You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nipun Arora <ni...@gmail.com> on 2015/10/24 23:08:33 UTC

[SPARK STREAMING] Concurrent operations in spark streaming

I wanted to understand something about the internals of spark streaming
executions.

If I have a stream X, and in my program I send stream X to function A and
function B:

1. In function A, I do a few transform/filter operations etc. on X->Y->Z to
create stream Z. Now I do a forEach Operation on Z and print the output to
a file.

2. Then in function B, I reduce stream X -> X2 (say min value of each RDD),
and print the output to file

Are both functions being executed for each RDD in parallel? How does it
work?

Thanks
Nipun

Re: [SPARK STREAMING] Concurrent operations in spark streaming

Posted by Adrian Tanase <at...@adobe.com>.

Thinking more about it – it should only be 2 tasks as A and B are most likely collapsed by spark in a single task.

Again – learn to use the spark UI as it’s really informative. The combination of DAG visualization and task count should answer most of your questions.

-adrian

From: Adrian Tanase
Date: Monday, October 26, 2015 at 11:57 AM
To: Nipun Arora, Andy Dang
Cc: user
Subject: Re: [SPARK STREAMING] Concurrent operations in spark streaming

If I understand the order correctly, not really. First of all, the easiest way to make sure it works as expected is to check out the visual DAG in the spark UI.

It should map 1:1 to your code, and since I don’t see any shuffles in the operations below it should execute all in one stage, forking after X.
That means that all the executor cores will each process a partition completely in isolation, most likely 3 tasks (A, B, X2). Most likely in the order you define in code although depending on the data some tasks may get skipped or moved around.

I’m curious – why do you ask? Do you have a particular concern or use case that relies on ordering between A, B and X2?

-adrian

From: Nipun Arora
Date: Sunday, October 25, 2015 at 4:09 PM
To: Andy Dang
Cc: user
Subject: Re: [SPARK STREAMING] Concurrent operations in spark streaming

So essentially the driver/client program needs to explicitly have two threads to ensure concurrency?

What happens when the program is sequential... I.e. I execute function A and then function B. Does this mean that each RDD first goes through function A, and them stream X is persisted, but processed in function B only after the RDD has been processed by A?

Thanks
Nipun
On Sat, Oct 24, 2015 at 5:32 PM Andy Dang <na...@gmail.com>> wrote:
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would assume that the ordering wouldn't matter.

-------
Regards,
Andy

On Sat, Oct 24, 2015 at 10:08 PM, Nipun Arora <ni...@gmail.com>> wrote:
I wanted to understand something about the internals of spark streaming executions.

If I have a stream X, and in my program I send stream X to function A and function B:

1. In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z and print the output to a file.

2. Then in function B, I reduce stream X -> X2 (say min value of each RDD), and print the output to file

Are both functions being executed for each RDD in parallel? How does it work?

Thanks
Nipun

Re: [SPARK STREAMING] Concurrent operations in spark streaming

Posted by Adrian Tanase <at...@adobe.com>.

If I understand the order correctly, not really. First of all, the easiest way to make sure it works as expected is to check out the visual DAG in the spark UI.

It should map 1:1 to your code, and since I don’t see any shuffles in the operations below it should execute all in one stage, forking after X.
That means that all the executor cores will each process a partition completely in isolation, most likely 3 tasks (A, B, X2). Most likely in the order you define in code although depending on the data some tasks may get skipped or moved around.

I’m curious – why do you ask? Do you have a particular concern or use case that relies on ordering between A, B and X2?

-adrian

From: Nipun Arora
Date: Sunday, October 25, 2015 at 4:09 PM
To: Andy Dang
Cc: user
Subject: Re: [SPARK STREAMING] Concurrent operations in spark streaming

So essentially the driver/client program needs to explicitly have two threads to ensure concurrency?

What happens when the program is sequential... I.e. I execute function A and then function B. Does this mean that each RDD first goes through function A, and them stream X is persisted, but processed in function B only after the RDD has been processed by A?

Thanks
Nipun
On Sat, Oct 24, 2015 at 5:32 PM Andy Dang <na...@gmail.com>> wrote:
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would assume that the ordering wouldn't matter.

-------
Regards,
Andy

On Sat, Oct 24, 2015 at 10:08 PM, Nipun Arora <ni...@gmail.com>> wrote:
I wanted to understand something about the internals of spark streaming executions.

If I have a stream X, and in my program I send stream X to function A and function B:

1. In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z and print the output to a file.

2. Then in function B, I reduce stream X -> X2 (say min value of each RDD), and print the output to file

Are both functions being executed for each RDD in parallel? How does it work?

Thanks
Nipun

Re: [SPARK STREAMING] Concurrent operations in spark streaming

Posted by Nipun Arora <ni...@gmail.com>.

So essentially the driver/client program needs to explicitly have two
threads to ensure concurrency?

What happens when the program is sequential... I.e. I execute function A
and then function B. Does this mean that each RDD first goes through
function A, and them stream X is persisted, but processed in function B
only after the RDD has been processed by A?

Thanks
Nipun
On Sat, Oct 24, 2015 at 5:32 PM Andy Dang <na...@gmail.com> wrote:

> If you execute the collect step (foreach in 1, possibly reduce in 2) in
> two threads in the driver then both of them will be executed in parallel.
> Whichever gets submitted to Spark first gets executed first - you can use a
> semaphore if you need to ensure the ordering of execution, though I would
> assume that the ordering wouldn't matter.
>
> -------
> Regards,
> Andy
>
> On Sat, Oct 24, 2015 at 10:08 PM, Nipun Arora <ni...@gmail.com>
> wrote:
>
>> I wanted to understand something about the internals of spark streaming
>> executions.
>>
>> If I have a stream X, and in my program I send stream X to function A and
>> function B:
>>
>> 1. In function A, I do a few transform/filter operations etc. on X->Y->Z
>> to create stream Z. Now I do a forEach Operation on Z and print the output
>> to a file.
>>
>> 2. Then in function B, I reduce stream X -> X2 (say min value of each
>> RDD), and print the output to file
>>
>> Are both functions being executed for each RDD in parallel? How does it
>> work?
>>
>> Thanks
>> Nipun
>>
>>
>

Re: [SPARK STREAMING] Concurrent operations in spark streaming

Posted by Andy Dang <na...@gmail.com>.

If you execute the collect step (foreach in 1, possibly reduce in 2) in two
threads in the driver then both of them will be executed in parallel.
Whichever gets submitted to Spark first gets executed first - you can use a
semaphore if you need to ensure the ordering of execution, though I would
assume that the ordering wouldn't matter.

-------
Regards,
Andy

On Sat, Oct 24, 2015 at 10:08 PM, Nipun Arora <ni...@gmail.com>
wrote:

> I wanted to understand something about the internals of spark streaming
> executions.
>
> If I have a stream X, and in my program I send stream X to function A and
> function B:
>
> 1. In function A, I do a few transform/filter operations etc. on X->Y->Z
> to create stream Z. Now I do a forEach Operation on Z and print the output
> to a file.
>
> 2. Then in function B, I reduce stream X -> X2 (say min value of each
> RDD), and print the output to file
>
> Are both functions being executed for each RDD in parallel? How does it
> work?
>
> Thanks
> Nipun
>
>