You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Daniel Imberman <da...@gmail.com> on 2016/01/04 20:24:04 UTC

Comparing Subsets of an RDD

Hi,

I’m looking for a way to compare subsets of an RDD intelligently.

 Lets say I had an RDD with key/value pairs of type (Int->T). I eventually
need to say “compare all values of key 1 with all values of key 2 and
compare values of key 3 to the values of key 5 and key 7”, how would I go
about doing this efficiently?

The way I’m currently thinking of doing it is by creating a List of
filtered RDDs and then using RDD.cartesian()


def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) =>
name == b}

Val keyPairs:(Int, Int) // all key pairs

Val rddPairs = keyPairs.map{

            case (a, b) =>

                filterSubset(a,r).cartesian(filterSubset(b,r))

        }

rddPairs.map{whatever I want to compare…}



I would then iterate the list and perform a map on each of the RDDs of
pairs to gather the relational data that I need.



What I can’t tell about this idea is whether it would be extremely
inefficient to set up possibly of hundreds of map jobs and then iterate
through them. In this case, would the lazy valuation in spark optimize the
data shuffling between all of the maps? If not, can someone please
recommend a possibly more efficient way to approach this issue?


Thank you for your help

Comparing Subsets of an RDD

Posted by Daniel Imberman <da...@gmail.com>.
Hi,

I’m looking for a way to compare subsets of an RDD intelligently.

 Lets say I had an RDD with key/value pairs of type (Int->T). I eventually
need to say “compare all values of key 1 with all values of key 2 and
compare values of key 3 to the values of key 5 and key 7”, how would I go
about doing this efficiently?

The way I’m currently thinking of doing it is by creating a List of
filtered RDDs and then using RDD.cartesian()


def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) =>
name == b}

Val keyPairs:(Int, Int) // all key pairs

Val rddPairs = keyPairs.map{

            case (a, b) =>

                filterSubset(a,r).cartesian(filterSubset(b,r))

        }

rddPairs.map{whatever I want to compare…}



I would then iterate the list and perform a map on each of the RDDs of
pairs to gather the relational data that I need.



What I can’t tell about this idea is whether it would be extremely
inefficient to set up possibly of hundreds of map jobs and then iterate
through them. In this case, would the lazy valuation in spark optimize the
data shuffling between all of the maps? If not, can someone please
recommend a possibly more efficient way to approach this issue?


Thank you for your help and apologies if this email sends more than once
(I'm having some issues with the mailing list)