You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Marius Danciu <ma...@gmail.com> on 2015/07/03 09:13:20 UTC
Optimizations
Hi all,
If I have something like:
rdd.join(...).mapPartitionToPair(...)
It looks like mapPartitionToPair runs in a different stage then join. Is
there a way to piggyback this computation inside the join stage ? ... such
that each result partition after join is passed to
the mapPartitionToPair function, all running in the same state without any
other costs.
Best,
Marius
Re: Optimizations
Posted by Raghavendra Pandey <ra...@gmail.com>.
This is the basic design of spark that it runs all actions in different
stages...
Not sure you can achieve what you r looking for.
On Jul 3, 2015 12:43 PM, "Marius Danciu" <ma...@gmail.com> wrote:
> Hi all,
>
> If I have something like:
>
> rdd.join(...).mapPartitionToPair(...)
>
> It looks like mapPartitionToPair runs in a different stage then join. Is
> there a way to piggyback this computation inside the join stage ? ... such
> that each result partition after join is passed to
> the mapPartitionToPair function, all running in the same state without any
> other costs.
>
> Best,
> Marius
>
Re: Optimizations
Posted by Marius Danciu <ma...@gmail.com>.
Thanks for your feedback. Yes I am aware of stages design and Silvio what
you are describing is essentially map-side join which is not applicable
when you have both RDDs quite large.
It appears that
rdd.join(...).mapToPair(f)
f is piggybacked inside join stage (right in the reducers I believe)
whereas
rdd.join(...).mapPartitionToPair( f )
f is executed in a different stage. This is surprising because at least
intuitively the difference between mapToPair and mapPartitionToPair is that
that former is about the push model whereas the latter is about polling
records out of the iterator (*I suspect there are other technical reasons*).
If anyone know the depths of the problem if would be of great help.
Best,
Marius
On Fri, Jul 3, 2015 at 6:43 PM Silvio Fiorito <si...@granturing.com>
wrote:
> One thing you could do is a broadcast join. You take your smaller RDD,
> save it as a broadcast variable. Then run a map operation to perform the
> join and whatever else you need to do. This will remove a shuffle stage but
> you will still have to collect the joined RDD and broadcast it. All depends
> on the size of your data if it’s worth it or not.
>
> From: Marius Danciu
> Date: Friday, July 3, 2015 at 3:13 AM
> To: user
> Subject: Optimizations
>
> Hi all,
>
> If I have something like:
>
> rdd.join(...).mapPartitionToPair(...)
>
> It looks like mapPartitionToPair runs in a different stage then join. Is
> there a way to piggyback this computation inside the join stage ? ... such
> that each result partition after join is passed to
> the mapPartitionToPair function, all running in the same state without any
> other costs.
>
> Best,
> Marius
>
Re: Optimizations
Posted by Silvio Fiorito <si...@granturing.com>.
One thing you could do is a broadcast join. You take your smaller RDD, save it as a broadcast variable. Then run a map operation to perform the join and whatever else you need to do. This will remove a shuffle stage but you will still have to collect the joined RDD and broadcast it. All depends on the size of your data if it’s worth it or not.
From: Marius Danciu
Date: Friday, July 3, 2015 at 3:13 AM
To: user
Subject: Optimizations
Hi all,
If I have something like:
rdd.join(...).mapPartitionToPair(...)
It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair function, all running in the same state without any other costs.
Best,
Marius