You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Antony Mayi <an...@yahoo.com.INVALID> on 2014/12/05 14:07:13 UTC

cartesian on pyspark not paralleised

Hi,

using pyspark 1.1.0 on YARN 2.5.0. all operations run nicely in parallel - I can seen multiple python processes spawned on each nodemanager but from some reason when running cartesian there is only single python process running on each node. the task is indicating thousands of partitions so don't understand why it is not running with higher parallelism. the performance is obviously poor although other operation rocks.

any idea how to improve this?

thank you,
Antony.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: cartesian on pyspark not paralleised

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

You could try increasing the level of parallelism
(spark.default.parallelism) while creating the sparkContext

Thanks
Best Regards

On Fri, Dec 5, 2014 at 6:37 PM, Antony Mayi <an...@yahoo.com.invalid>
wrote:

> Hi,
>
> using pyspark 1.1.0 on YARN 2.5.0. all operations run nicely in parallel -
> I can seen multiple python processes spawned on each nodemanager but from
> some reason when running cartesian there is only single python process
> running on each node. the task is indicating thousands of partitions so
> don't understand why it is not running with higher parallelism. the
> performance is obviously poor although other operation rocks.
>
> any idea how to improve this?
>
> thank you,
> Antony.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>