You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by flyinggip <my...@hotmail.com> on 2016/05/12 13:34:13 UTC

Efficient for loops in Spark

Hi there,

I'd like to write some iterative computation, i.e., computation that can be
done via a for loop. I understand that in Spark foreach is a better choice.
However, foreach and foreachPartition seem to be for self-contained
computation that only involves the corresponding Row or Partition,
respectively. But in my application each computational task does not only
involve one partition, but also other partitions. It's just that every task
has a specific way of using the corresponding partition and the other
partitions. An example application will be cross-validation in machine
learning, where each fold corresponds to a partition, e.g., the whole data
is divided into 5 folds, then for task 1, I use fold 1 for testing and folds
2,3,4,5 for training; for task 2, I use fold 2 for testing and folds 1,3,4,5
for training; etc.

In this case, if I were to use foreachPartition, it seems that I need to
duplicate the data the number of times equal to the number of folds (or
iterations of my for loop). More generally, I would need to still prepare a
partition for every distributed task and that partition would need to
include all the data needed for the task, which could be a huge waste of
space.

Is there any other solutions? Thanks.

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-for-loops-in-Spark-tp26939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Efficient for loops in Spark

Posted by Erik Erlandson <ej...@redhat.com>.

Regarding the specific problem of generating random folds in a more efficient way, this should help:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions

It uses a sort of multiplexing formalism on RDDs:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions

I wrote a blog post to explain the idea here:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/



----- Original Message -----
> Hi there,
> 
> I'd like to write some iterative computation, i.e., computation that can be
> done via a for loop. I understand that in Spark foreach is a better choice.
> However, foreach and foreachPartition seem to be for self-contained
> computation that only involves the corresponding Row or Partition,
> respectively. But in my application each computational task does not only
> involve one partition, but also other partitions. It's just that every task
> has a specific way of using the corresponding partition and the other
> partitions. An example application will be cross-validation in machine
> learning, where each fold corresponds to a partition, e.g., the whole data
> is divided into 5 folds, then for task 1, I use fold 1 for testing and folds
> 2,3,4,5 for training; for task 2, I use fold 2 for testing and folds 1,3,4,5
> for training; etc.
> 
> In this case, if I were to use foreachPartition, it seems that I need to
> duplicate the data the number of times equal to the number of folds (or
> iterations of my for loop). More generally, I would need to still prepare a
> partition for every distributed task and that partition would need to
> include all the data needed for the task, which could be a huge waste of
> space.
> 
> Is there any other solutions? Thanks.
> 
> f.
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-for-loops-in-Spark-tp26939.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org