You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Daniel Imberman <da...@gmail.com> on 2016/01/04 20:24:04 UTC
Comparing Subsets of an RDD
Hi,
I’m looking for a way to compare subsets of an RDD intelligently.
Lets say I had an RDD with key/value pairs of type (Int->T). I eventually
need to say “compare all values of key 1 with all values of key 2 and
compare values of key 3 to the values of key 5 and key 7”, how would I go
about doing this efficiently?
The way I’m currently thinking of doing it is by creating a List of
filtered RDDs and then using RDD.cartesian()
def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) =>
name == b}
Val keyPairs:(Int, Int) // all key pairs
Val rddPairs = keyPairs.map{
case (a, b) =>
filterSubset(a,r).cartesian(filterSubset(b,r))
}
rddPairs.map{whatever I want to compare…}
I would then iterate the list and perform a map on each of the RDDs of
pairs to gather the relational data that I need.
What I can’t tell about this idea is whether it would be extremely
inefficient to set up possibly of hundreds of map jobs and then iterate
through them. In this case, would the lazy valuation in spark optimize the
data shuffling between all of the maps? If not, can someone please
recommend a possibly more efficient way to approach this issue?
Thank you for your help
Comparing Subsets of an RDD
Posted by Daniel Imberman <da...@gmail.com>.
Hi,
I’m looking for a way to compare subsets of an RDD intelligently.
Lets say I had an RDD with key/value pairs of type (Int->T). I eventually
need to say “compare all values of key 1 with all values of key 2 and
compare values of key 3 to the values of key 5 and key 7”, how would I go
about doing this efficiently?
The way I’m currently thinking of doing it is by creating a List of
filtered RDDs and then using RDD.cartesian()
def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) =>
name == b}
Val keyPairs:(Int, Int) // all key pairs
Val rddPairs = keyPairs.map{
case (a, b) =>
filterSubset(a,r).cartesian(filterSubset(b,r))
}
rddPairs.map{whatever I want to compare…}
I would then iterate the list and perform a map on each of the RDDs of
pairs to gather the relational data that I need.
What I can’t tell about this idea is whether it would be extremely
inefficient to set up possibly of hundreds of map jobs and then iterate
through them. In this case, would the lazy valuation in spark optimize the
data shuffling between all of the maps? If not, can someone please
recommend a possibly more efficient way to approach this issue?
Thank you for your help and apologies if this email sends more than once
(I'm having some issues with the mailing list)