You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by David Thomas <dt...@gmail.com> on 2014/02/17 18:51:18 UTC

Interleaving stages

I have a spark application that has the below structure:

while(...) { // 10-100k iterations
  rdd.map(...).collect
}

Basically, I have an RDD and I need to query it multiple times.

Now when I run this, for each iteration, Spark creates a new stage (each
stage having multiple tasks). What I find is that the stage execution takes
about 1 second and most time is spend in scheduling the tasks. Since a
stage is not submitted until the previous stage is completed, this loop
takes a long time to complete. So my question is, is there a way to
interleave multiple stage executions? Any other suggestions to improve the
above query pattern?

Re: Interleaving stages

Posted by Prashant Sharma <sc...@gmail.com>.

I am not sure !, may be Mark can correct me. You may try the
AsyncRDDFunctions, (check API docs for details.) I am feeling as if, it can
send many tasks and then result can be received Async.


On Tue, Feb 18, 2014 at 1:14 PM, Guillaume Pitel <guillaume.pitel@exensa.com
> wrote:

>  Whatever you want to do, if you really have to do it that way, don't use
> Spark. And the answer to your question is : Spark automatically
> "interleaves" stages that can be interleaved.
>
> Now, I do not believe that you really want to do that. You probably should
> just do a filter + map or a flatmap. But explain what you're trying to
> achieve so we can recommend you with a better way.
>
> Guillaume
>
> With so little information about what your code is actually doing, what
> you have shared looks likely to be an anti-pattern to me.  Doing many
> collect actions is something to be avoided if at all possible, since this
> forces a lot of network communication to materialize the results back
> within the driver process, and network communication severely constrains
> performance.
>
>
> On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <dt...@gmail.com> wrote:
>
>>   I have a spark application that has the below structure:
>>
>>  while(...) { // 10-100k iterations
>>    rdd.map(...).collect
>> }
>>
>>  Basically, I have an RDD and I need to query it multiple times.
>>
>>  Now when I run this, for each iteration, Spark creates a new stage (each
>> stage having multiple tasks). What I find is that the stage execution takes
>> about 1 second and most time is spend in scheduling the tasks. Since a
>> stage is not submitted until the previous stage is completed, this loop
>> takes a long time to complete. So my question is, is there a way to
>> interleave multiple stage executions? Any other suggestions to improve the
>> above query pattern?
>>
>
>
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80
>
> eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>



-- 
Prashant

Re: Interleaving stages

Posted by "Evan R. Sparks" <ev...@gmail.com>.

The difference between this code and the code you showed is that in this
code there's a .reduce() whose result is the size of ONE element in the
RDD, while your code calls .collect() whose result is the size of your
entire dataset.

It would help if you provided a more complete example.


On Tue, Feb 18, 2014 at 11:58 AM, David Thomas <dt...@gmail.com> wrote:

> Here is an example code that is bundled with Spark
>
> for (i <- 1 to ITERATIONS) {
>       println("On iteration " + i)
>       val gradient = points.map { p =>
>         (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x
>       }.reduce(_ + _)
>       w -= gradient
>     }
>
> As you can see, an action is called on every iteration. I guess this is
> the same pattern I have in my code. So why do you think this is not the
> right thing to do with Spark?
>
>
>
>
> On Tue, Feb 18, 2014 at 12:44 AM, Guillaume Pitel <
> guillaume.pitel@exensa.com> wrote:
>
>>  Whatever you want to do, if you really have to do it that way, don't
>> use Spark. And the answer to your question is : Spark automatically
>> "interleaves" stages that can be interleaved.
>>
>> Now, I do not believe that you really want to do that. You probably
>> should just do a filter + map or a flatmap. But explain what you're trying
>> to achieve so we can recommend you with a better way.
>>
>> Guillaume
>>
>> With so little information about what your code is actually doing, what
>> you have shared looks likely to be an anti-pattern to me.  Doing many
>> collect actions is something to be avoided if at all possible, since this
>> forces a lot of network communication to materialize the results back
>> within the driver process, and network communication severely constrains
>> performance.
>>
>>
>> On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <dt...@gmail.com>wrote:
>>
>>>   I have a spark application that has the below structure:
>>>
>>>  while(...) { // 10-100k iterations
>>>    rdd.map(...).collect
>>> }
>>>
>>>  Basically, I have an RDD and I need to query it multiple times.
>>>
>>>  Now when I run this, for each iteration, Spark creates a new stage
>>> (each stage having multiple tasks). What I find is that the stage execution
>>> takes about 1 second and most time is spend in scheduling the tasks. Since
>>> a stage is not submitted until the previous stage is completed, this loop
>>> takes a long time to complete. So my question is, is there a way to
>>> interleave multiple stage executions? Any other suggestions to improve the
>>> above query pattern?
>>>
>>
>>
>>
>> --
>>    [image: eXenSa]
>>  *Guillaume PITEL, Président*
>> +33(0)6 25 48 86 80
>>
>> eXenSa S.A.S. <http://www.exensa.com/>
>>  41, rue Périer - 92120 Montrouge - FRANCE
>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>>
>
>

Re: Interleaving stages

Posted by David Thomas <dt...@gmail.com>.

Here is an example code that is bundled with Spark

for (i <- 1 to ITERATIONS) {
      println("On iteration " + i)
      val gradient = points.map { p =>
        (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x
      }.reduce(_ + _)
      w -= gradient
    }

As you can see, an action is called on every iteration. I guess this is the
same pattern I have in my code. So why do you think this is not the right
thing to do with Spark?




On Tue, Feb 18, 2014 at 12:44 AM, Guillaume Pitel <
guillaume.pitel@exensa.com> wrote:

>  Whatever you want to do, if you really have to do it that way, don't use
> Spark. And the answer to your question is : Spark automatically
> "interleaves" stages that can be interleaved.
>
> Now, I do not believe that you really want to do that. You probably should
> just do a filter + map or a flatmap. But explain what you're trying to
> achieve so we can recommend you with a better way.
>
> Guillaume
>
> With so little information about what your code is actually doing, what
> you have shared looks likely to be an anti-pattern to me.  Doing many
> collect actions is something to be avoided if at all possible, since this
> forces a lot of network communication to materialize the results back
> within the driver process, and network communication severely constrains
> performance.
>
>
> On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <dt...@gmail.com> wrote:
>
>>   I have a spark application that has the below structure:
>>
>>  while(...) { // 10-100k iterations
>>    rdd.map(...).collect
>> }
>>
>>  Basically, I have an RDD and I need to query it multiple times.
>>
>>  Now when I run this, for each iteration, Spark creates a new stage (each
>> stage having multiple tasks). What I find is that the stage execution takes
>> about 1 second and most time is spend in scheduling the tasks. Since a
>> stage is not submitted until the previous stage is completed, this loop
>> takes a long time to complete. So my question is, is there a way to
>> interleave multiple stage executions? Any other suggestions to improve the
>> above query pattern?
>>
>
>
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80
>
> eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>

Re: Interleaving stages

Posted by Guillaume Pitel <gu...@exensa.com>.

Whatever you want to do, if you really have to do it that way, don't use Spark. 
And the answer to your question is : Spark automatically "interleaves" stages 
that can be interleaved.

Now, I do not believe that you really want to do that. You probably should just 
do a filter + map or a flatmap. But explain what you're trying to achieve so we 
can recommend you with a better way.

Guillaume
> With so little information about what your code is actually doing, what you 
> have shared looks likely to be an anti-pattern to me.  Doing many collect 
> actions is something to be avoided if at all possible, since this forces a lot 
> of network communication to materialize the results back within the driver 
> process, and network communication severely constrains performance.
>
>
> On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <dt5434884@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     I have a spark application that has the below structure:
>
>     while(...) { // 10-100k iterations
>       rdd.map(...).collect
>     }
>
>     Basically, I have an RDD and I need to query it multiple times.
>
>     Now when I run this, for each iteration, Spark creates a new stage (each
>     stage having multiple tasks). What I find is that the stage execution
>     takes about 1 second and most time is spend in scheduling the tasks. Since
>     a stage is not submitted until the previous stage is completed, this loop
>     takes a long time to complete. So my question is, is there a way to
>     interleave multiple stage executions? Any other suggestions to improve the
>     above query pattern?
>
>


-- 
eXenSa

	
*Guillaume PITEL, Président*
+33(0)6 25 48 86 80

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Re: Interleaving stages

Posted by David Thomas <dt...@gmail.com>.

Is there a way I can queue several stages at once?


On Mon, Feb 17, 2014 at 12:08 PM, Mark Hamstra <ma...@clearstorydata.com>wrote:

> With so little information about what your code is actually doing, what
> you have shared looks likely to be an anti-pattern to me.  Doing many
> collect actions is something to be avoided if at all possible, since this
> forces a lot of network communication to materialize the results back
> within the driver process, and network communication severely constrains
> performance.
>
>
> On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <dt...@gmail.com> wrote:
>
>> I have a spark application that has the below structure:
>>
>> while(...) { // 10-100k iterations
>>   rdd.map(...).collect
>> }
>>
>> Basically, I have an RDD and I need to query it multiple times.
>>
>> Now when I run this, for each iteration, Spark creates a new stage (each
>> stage having multiple tasks). What I find is that the stage execution takes
>> about 1 second and most time is spend in scheduling the tasks. Since a
>> stage is not submitted until the previous stage is completed, this loop
>> takes a long time to complete. So my question is, is there a way to
>> interleave multiple stage executions? Any other suggestions to improve the
>> above query pattern?
>>
>
>

Re: Interleaving stages

Posted by Mark Hamstra <ma...@clearstorydata.com>.

With so little information about what your code is actually doing, what you
have shared looks likely to be an anti-pattern to me.  Doing many collect
actions is something to be avoided if at all possible, since this forces a
lot of network communication to materialize the results back within the
driver process, and network communication severely constrains performance.

On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <dt...@gmail.com> wrote:

> I have a spark application that has the below structure:
>
> while(...) { // 10-100k iterations
>   rdd.map(...).collect
> }
>
> Basically, I have an RDD and I need to query it multiple times.
>
> Now when I run this, for each iteration, Spark creates a new stage (each
> stage having multiple tasks). What I find is that the stage execution takes
> about 1 second and most time is spend in scheduling the tasks. Since a
> stage is not submitted until the previous stage is completed, this loop
> takes a long time to complete. So my question is, is there a way to
> interleave multiple stage executions? Any other suggestions to improve the
> above query pattern?
>