You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by bethesda <sw...@mac.com> on 2014/12/18 23:18:20 UTC

Creating a smaller, derivative RDD from an RDD

We have a very large RDD and I need to create a new RDD whose values are
derived from each record of the original RDD, and we only retain the few new
records that meet a criteria.  I want to avoid creating a second large RDD
and then filtering it since I believe this could tax system resources
unnecessarily (tell me if that assumption is wrong.)

So for example, /and this is just an example/, say we have an RDD with 1 to
1,000,000 and we iterate through each value, and compute it's md5 hash, and
we only keep the results that start with 'A'.

What we've tried and seems to work but which seemed a bit ugly, and perhaps
not efficient, was the following in pseudocode. * Is this the best way to do
this?*

Thanks

bigRdd.flatMap( { i =>
  val h = md5(i)
  if (h.substring(1,1) == 'A') {
    Array(h)
  } else {
    Array[String]()
  }
})



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-smaller-derivative-RDD-from-an-RDD-tp20769.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Creating a smaller, derivative RDD from an RDD

Posted by Sean Owen <so...@cloudera.com>.

I don't think you can avoid examining each element of the RDD, if
that's what you mean. Your approach is basically the best you can do
in general. You're not making a second RDD here, and even if you did
this in two steps, the second RDD is really more of a bookkeeping that
a second huge data structure.

You can simplify your example a bit, although I doubt it's noticeably faster:

bigRdd.flatMap { i =>
  val h = md5(i)
  if (h(0) == 'A') {
    Some(h)
  } else {
    None
  }
}

This is also fine, simpler still, and if it's slower, not by much:

bigRdd.map(md5).filter(_(0) == 'A')


On Thu, Dec 18, 2014 at 10:18 PM, bethesda <sw...@mac.com> wrote:
> We have a very large RDD and I need to create a new RDD whose values are
> derived from each record of the original RDD, and we only retain the few new
> records that meet a criteria.  I want to avoid creating a second large RDD
> and then filtering it since I believe this could tax system resources
> unnecessarily (tell me if that assumption is wrong.)
>
> So for example, /and this is just an example/, say we have an RDD with 1 to
> 1,000,000 and we iterate through each value, and compute it's md5 hash, and
> we only keep the results that start with 'A'.
>
> What we've tried and seems to work but which seemed a bit ugly, and perhaps
> not efficient, was the following in pseudocode. * Is this the best way to do
> this?*
>
> Thanks
>
> bigRdd.flatMap( { i =>
>   val h = md5(i)
>   if (h.substring(1,1) == 'A') {
>     Array(h)
>   } else {
>     Array[String]()
>   }
> })
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-smaller-derivative-RDD-from-an-RDD-tp20769.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org