You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mohit Jaggi <mo...@gmail.com> on 2014/09/04 18:36:18 UTC

efficient zipping of lots of RDDs

Folks,
I sent an email announcing
https://github.com/AyasdiOpenSource/df

This dataframe is basically a map of RDDs of columns(along with DSL sugar),
as column based operations seem to be most common. But row operations are
not uncommon. To get rows out of columns right now I zip the column RDDs
together. I use RDD.zip then flatten the tuples I get. I realize that
RDD.zipPartitions might be faster. However, I believe an even better
approach should be possible. Surely we can have a zip method that can
combine a large variable number of RDDs? Can that be added to Spark-core?
Or is there an alternative equally good or better approach?

Cheers,
Mohit.

Re: efficient zipping of lots of RDDs

Posted by Mohit Jaggi <mo...@gmail.com>.
filed  jira SPARK-3489  <https://issues.apache.org/jira/browse/SPARK-3489>

On Thu, Sep 4, 2014 at 9:36 AM, Mohit Jaggi <mo...@gmail.com> wrote:

> Folks,
> I sent an email announcing
> https://github.com/AyasdiOpenSource/df
>
> This dataframe is basically a map of RDDs of columns(along with DSL
> sugar), as column based operations seem to be most common. But row
> operations are not uncommon. To get rows out of columns right now I zip the
> column RDDs together. I use RDD.zip then flatten the tuples I get. I
> realize that RDD.zipPartitions might be faster. However, I believe an even
> better approach should be possible. Surely we can have a zip method that
> can combine a large variable number of RDDs? Can that be added to
> Spark-core? Or is there an alternative equally good or better approach?
>
> Cheers,
> Mohit.
>