You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Marius Danciu <ma...@gmail.com> on 2015/07/03 09:13:20 UTC

Optimizations

Hi all,

If I have something like:

rdd.join(...).mapPartitionToPair(...)

It looks like mapPartitionToPair runs in a different stage then join. Is
there a way to piggyback this computation inside the join stage ? ... such
that each result partition after join is passed to
the mapPartitionToPair function, all running in the same state without any
other costs.

Best,
Marius

Re: Optimizations

Posted by Raghavendra Pandey <ra...@gmail.com>.

This is the basic design of spark that it runs all actions in different
stages...
Not sure you can achieve what you r looking for.
On Jul 3, 2015 12:43 PM, "Marius Danciu" <ma...@gmail.com> wrote:

> Hi all,
>
> If I have something like:
>
> rdd.join(...).mapPartitionToPair(...)
>
> It looks like mapPartitionToPair runs in a different stage then join. Is
> there a way to piggyback this computation inside the join stage ? ... such
> that each result partition after join is passed to
> the mapPartitionToPair function, all running in the same state without any
> other costs.
>
> Best,
> Marius
>

Re: Optimizations

Posted by Marius Danciu <ma...@gmail.com>.

Thanks for your feedback. Yes I am aware of stages design and Silvio what
you are describing is essentially map-side join which is not applicable
when you have both RDDs quite large.

It appears that

rdd.join(...).mapToPair(f)
f is piggybacked inside join stage  (right in the reducers I believe)

whereas

rdd.join(...).mapPartitionToPair( f )

f is executed in a different stage. This is surprising because at least
intuitively the difference between mapToPair and mapPartitionToPair is that
that former is about the push model whereas the latter is about polling
records out of the iterator (*I suspect there are other technical reasons*).
If anyone know the depths of the problem if would be of great help.

Best,
Marius

On Fri, Jul 3, 2015 at 6:43 PM Silvio Fiorito <si...@granturing.com>
wrote:

>   One thing you could do is a broadcast join. You take your smaller RDD,
> save it as a broadcast variable. Then run a map operation to perform the
> join and whatever else you need to do. This will remove a shuffle stage but
> you will still have to collect the joined RDD and broadcast it. All depends
> on the size of your data if it’s worth it or not.
>
>   From: Marius Danciu
> Date: Friday, July 3, 2015 at 3:13 AM
> To: user
> Subject: Optimizations
>
>   Hi all,
>
>  If I have something like:
>
>  rdd.join(...).mapPartitionToPair(...)
>
>  It looks like mapPartitionToPair runs in a different stage then join. Is
> there a way to piggyback this computation inside the join stage ? ... such
> that each result partition after join is passed to
> the mapPartitionToPair function, all running in the same state without any
> other costs.
>
>  Best,
> Marius
>

Re: Optimizations

Posted by Silvio Fiorito <si...@granturing.com>.

One thing you could do is a broadcast join. You take your smaller RDD, save it as a broadcast variable. Then run a map operation to perform the join and whatever else you need to do. This will remove a shuffle stage but you will still have to collect the joined RDD and broadcast it. All depends on the size of your data if it’s worth it or not.

From: Marius Danciu
Date: Friday, July 3, 2015 at 3:13 AM
To: user
Subject: Optimizations

Hi all,

If I have something like:

rdd.join(...).mapPartitionToPair(...)

It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair function, all running in the same state without any other costs.

Best,
Marius