You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Anselme Vignon <an...@flaminem.com> on 2014/12/02 14:23:23 UTC

Parallelize independent tasks

Hi folks,


We have written a spark job that scans multiple hdfs directories and
perform transformations on them.

For now, this is done with a simple for loop that starts one task at
each iteration. This looks like:

dirs.foreach { case (src,dest) => sc.textFile(src).process.saveAsFile(dest) }


However, each iteration is independent, and we would like to optimize
that by running
them with spark simultaneously (or in a chained fashion), such that we
don't have
idle executors at the end of each iteration (some directories
sometimes only require one partition)


Has anyone already done such a thing? How would you suggest we could do that?

Cheers,

Anselme

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Parallelize independent tasks

Posted by Victor Tso-Guillen <vt...@paxata.com>.
dirs.par.foreach { case (src,dest) =>
sc.textFile(src).process.saveAsFile(dest) }

Is that sufficient for you?

On Tuesday, December 2, 2014, Anselme Vignon <an...@flaminem.com>
wrote:

> Hi folks,
>
>
> We have written a spark job that scans multiple hdfs directories and
> perform transformations on them.
>
> For now, this is done with a simple for loop that starts one task at
> each iteration. This looks like:
>
> dirs.foreach { case (src,dest) =>
> sc.textFile(src).process.saveAsFile(dest) }
>
>
> However, each iteration is independent, and we would like to optimize
> that by running
> them with spark simultaneously (or in a chained fashion), such that we
> don't have
> idle executors at the end of each iteration (some directories
> sometimes only require one partition)
>
>
> Has anyone already done such a thing? How would you suggest we could do
> that?
>
> Cheers,
>
> Anselme
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-help@spark.apache.org <javascript:;>
>
>