You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Manoj Samel <ma...@gmail.com> on 2014/01/24 21:56:48 UTC

How to create RDD over hashmap?

Is there a way to create RDD over a hashmap ?

If I have a hash map and try sc.parallelize, it gives

<console>:17: error: type mismatch;
 found   : scala.collection.mutable.HashMap[String,Double]
 required: Seq[?]
Error occurred in an application involving default arguments.
       val cr_rdd = sc.parallelize(cr)
                                   ^

Re: How to create RDD over hashmap?

Posted by Manoj Samel <ma...@gmail.com>.

sc.parallelize(cr.iterator) also gives error

scala> val cr_rdd = sc.parallelize(cr.iterator)
<console>:17: error: type mismatch;
 found   : Iterator[(String, Double)]
 required: Seq[?]
Error occurred in an application involving default arguments.
       val cr_rdd = sc.parallelize(cr.iterator)
                                      ^


On Fri, Jan 24, 2014 at 1:00 PM, Andrew Ash <an...@andrewash.com> wrote:

> In Java you'd want to convert it to an entry set, which is a set of (key,
> value) pairs from the hashmap.  The closest I can see in scaladoc is the
> .iterator method -- try that?
>
>
> On Fri, Jan 24, 2014 at 12:56 PM, Manoj Samel <ma...@gmail.com>wrote:
>
>> Is there a way to create RDD over a hashmap ?
>>
>> If I have a hash map and try sc.parallelize, it gives
>>
>> <console>:17: error: type mismatch;
>>  found   : scala.collection.mutable.HashMap[String,Double]
>>  required: Seq[?]
>> Error occurred in an application involving default arguments.
>>        val cr_rdd = sc.parallelize(cr)
>>                                    ^
>>
>
>

Re: How to create RDD over hashmap?

Posted by Andrew Ash <an...@andrewash.com>.

In Java you'd want to convert it to an entry set, which is a set of (key,
value) pairs from the hashmap.  The closest I can see in scaladoc is the
.iterator method -- try that?

On Fri, Jan 24, 2014 at 12:56 PM, Manoj Samel <ma...@gmail.com>wrote:

> Is there a way to create RDD over a hashmap ?
>
> If I have a hash map and try sc.parallelize, it gives
>
> <console>:17: error: type mismatch;
>  found   : scala.collection.mutable.HashMap[String,Double]
>  required: Seq[?]
> Error occurred in an application involving default arguments.
>        val cr_rdd = sc.parallelize(cr)
>                                    ^
>

Re: How to create RDD over hashmap?

Posted by Cheng Lian <rh...@gmail.com>.

RDD is essentially a distributed *vector*, and is partitioned over multiple
worker nodes at runtime.  So HashMap like O(1) key lookup over the whole
RDD is not applicable, but you can turn all the key/value pairs within a
single partition into a HashMap via RDD.mapPartitions:

  someRdd.mapPartitions { iter =>
    val hashMap = iter.toMap
    ...
  }




On Sat, Jan 25, 2014 at 5:11 AM, Manoj Samel <ma...@gmail.com>wrote:

> Yes, that works.
>
> But then the hashmap functionality of the fast key lookup etc. is gone and
> the search will be linear using a iterator etc. Not sure if Spark
> internally creates additional optimizations for Seq but otherwise one has
> to assume this becomes a List/Array without a fast key lookup of a hashmap
> or b-tree
>
> Any thoughts ?
>
>
>
>
>
> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
> fnothaft@berkeley.edu> wrote:
>
>> Manoj,
>>
>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you
>> just do:
>>
>> val cr_rdd = sc.parallelize(cr.toSeq)
>>
>> The toSeq would convert the HashMap[String,Double] into a Seq[(String,
>> Double)] before calling the parallelize function.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu
>> fnothaft@eecs.berkeley.edu
>> 202-340-0466
>>
>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com>
>> wrote:
>>
>> > Is there a way to create RDD over a hashmap ?
>> >
>> > If I have a hash map and try sc.parallelize, it gives
>> >
>> > <console>:17: error: type mismatch;
>> >  found   : scala.collection.mutable.HashMap[String,Double]
>> >  required: Seq[?]
>> > Error occurred in an application involving default arguments.
>> >        val cr_rdd = sc.parallelize(cr)
>> >                                    ^
>>
>>
>

Re: How to create RDD over hashmap?

Posted by Manoj Samel <ma...@gmail.com>.

Thanks to all suggestions, I am able to make progress on it.

Manoj


On Fri, Jan 24, 2014 at 1:54 PM, Tathagata Das
<ta...@gmail.com>wrote:

> On this note, you can do something smarter that the basic lookup function.
> You could convert each partition of the key-value pair RDD into a hashmap
> using something like
>
> val rddOfHashmaps = pairRDD.mapPartitions(iterator => {
>    val hashmap = new HashMap[String, ArrayBuffer[Double]]
>    iterator.foreach { case (key, value}  => hashmap.getOrElseUpdate(key,
> new ArrayBuffer[Double]) += value
>    Iterator(hashmap)
>  }, preserveParitioning = true)
>
> And then you can do a variation of the lookup function<https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549>to lookup the right partition, and then within that partition directly
> lookup the hashmap and return the value (rather than scanning the whole
> partition). That give practically O(1) lookup time instead of O(N). But i
> doubt it will match something that a dedicated lookup system like memcached
> would achieve.
>
> TD
>
>
>
>
> On Fri, Jan 24, 2014 at 1:36 PM, Andrew Ash <an...@andrewash.com> wrote:
>
>> By my reading of the code, it uses the partitioner to decide which worker
>> the key lands on, then does an O(N) scan of that partition.  I think we're
>> saying the same thing.
>>
>>
>> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549
>>
>>
>> On Fri, Jan 24, 2014 at 1:26 PM, Cheng Lian <rh...@gmail.com>wrote:
>>
>>> PairRDDFunctions.lookup is good enough in Spark, it's just that its time
>>> complexity is O(N).  Of course, for RDDs equipped with a partitioner, N is
>>> the average size of a partition.
>>>
>>>
>>> On Sat, Jan 25, 2014 at 5:16 AM, Andrew Ash <an...@andrewash.com>wrote:
>>>
>>>> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup()
>>>> method on it for faster access.
>>>>
>>>>
>>>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
>>>>
>>>> Spark's strength is running computations across a large set of data.
>>>>  If you're trying to do fast lookup of a few individual keys, I'd recommend
>>>> something more like memcached or Elasticsearch.
>>>>
>>>>
>>>> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <ma...@gmail.com>wrote:
>>>>
>>>>> Yes, that works.
>>>>>
>>>>> But then the hashmap functionality of the fast key lookup etc. is gone
>>>>> and the search will be linear using a iterator etc. Not sure if Spark
>>>>> internally creates additional optimizations for Seq but otherwise one has
>>>>> to assume this becomes a List/Array without a fast key lookup of a hashmap
>>>>> or b-tree
>>>>>
>>>>> Any thoughts ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
>>>>> fnothaft@berkeley.edu> wrote:
>>>>>
>>>>>> Manoj,
>>>>>>
>>>>>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t
>>>>>> you just do:
>>>>>>
>>>>>> val cr_rdd = sc.parallelize(cr.toSeq)
>>>>>>
>>>>>> The toSeq would convert the HashMap[String,Double] into a
>>>>>> Seq[(String, Double)] before calling the parallelize function.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Frank Austin Nothaft
>>>>>> fnothaft@berkeley.edu
>>>>>> fnothaft@eecs.berkeley.edu
>>>>>> 202-340-0466
>>>>>>
>>>>>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> > Is there a way to create RDD over a hashmap ?
>>>>>> >
>>>>>> > If I have a hash map and try sc.parallelize, it gives
>>>>>> >
>>>>>> > <console>:17: error: type mismatch;
>>>>>> >  found   : scala.collection.mutable.HashMap[String,Double]
>>>>>> >  required: Seq[?]
>>>>>> > Error occurred in an application involving default arguments.
>>>>>> >        val cr_rdd = sc.parallelize(cr)
>>>>>> >                                    ^
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to create RDD over hashmap?

Posted by Mark Hamstra <ma...@clearstorydata.com>.

You usually don't need to do so explicitly since the implicit conversions
in Spark will take care of that for you.  Any RDD[(K, V)] is a PairRDD; so,
e.g., sc.parallelize(1 to 10).map(i => (i, i.toString)) is just one of many
ways to generate a PairRDD .


On Fri, Jan 24, 2014 at 2:23 PM, Manoj Samel <ma...@gmail.com>wrote:

> How would I create a PairRDD ?
>
>
> On Fri, Jan 24, 2014 at 1:54 PM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> On this note, you can do something smarter that the basic lookup
>> function. You could convert each partition of the key-value pair RDD into a
>> hashmap using something like
>>
>> val rddOfHashmaps = pairRDD.mapPartitions(iterator => {
>>    val hashmap = new HashMap[String, ArrayBuffer[Double]]
>>    iterator.foreach { case (key, value}  => hashmap.getOrElseUpdate(key,
>> new ArrayBuffer[Double]) += value
>>    Iterator(hashmap)
>>  }, preserveParitioning = true)
>>
>> And then you can do a variation of the lookup function<https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549>to lookup the right partition, and then within that partition directly
>> lookup the hashmap and return the value (rather than scanning the whole
>> partition). That give practically O(1) lookup time instead of O(N). But i
>> doubt it will match something that a dedicated lookup system like memcached
>> would achieve.
>>
>> TD
>>
>>
>>
>>
>> On Fri, Jan 24, 2014 at 1:36 PM, Andrew Ash <an...@andrewash.com> wrote:
>>
>>> By my reading of the code, it uses the partitioner to decide which
>>> worker the key lands on, then does an O(N) scan of that partition.  I think
>>> we're saying the same thing.
>>>
>>>
>>> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549
>>>
>>>
>>> On Fri, Jan 24, 2014 at 1:26 PM, Cheng Lian <rh...@gmail.com>wrote:
>>>
>>>> PairRDDFunctions.lookup is good enough in Spark, it's just that its
>>>> time complexity is O(N).  Of course, for RDDs equipped with a partitioner,
>>>> N is the average size of a partition.
>>>>
>>>>
>>>> On Sat, Jan 25, 2014 at 5:16 AM, Andrew Ash <an...@andrewash.com>wrote:
>>>>
>>>>> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup()
>>>>> method on it for faster access.
>>>>>
>>>>>
>>>>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
>>>>>
>>>>> Spark's strength is running computations across a large set of data.
>>>>>  If you're trying to do fast lookup of a few individual keys, I'd recommend
>>>>> something more like memcached or Elasticsearch.
>>>>>
>>>>>
>>>>> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <manojsameltech@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Yes, that works.
>>>>>>
>>>>>> But then the hashmap functionality of the fast key lookup etc. is
>>>>>> gone and the search will be linear using a iterator etc. Not sure if Spark
>>>>>> internally creates additional optimizations for Seq but otherwise one has
>>>>>> to assume this becomes a List/Array without a fast key lookup of a hashmap
>>>>>> or b-tree
>>>>>>
>>>>>> Any thoughts ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
>>>>>> fnothaft@berkeley.edu> wrote:
>>>>>>
>>>>>>> Manoj,
>>>>>>>
>>>>>>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t
>>>>>>> you just do:
>>>>>>>
>>>>>>> val cr_rdd = sc.parallelize(cr.toSeq)
>>>>>>>
>>>>>>> The toSeq would convert the HashMap[String,Double] into a
>>>>>>> Seq[(String, Double)] before calling the parallelize function.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Frank Austin Nothaft
>>>>>>> fnothaft@berkeley.edu
>>>>>>> fnothaft@eecs.berkeley.edu
>>>>>>> 202-340-0466
>>>>>>>
>>>>>>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > Is there a way to create RDD over a hashmap ?
>>>>>>> >
>>>>>>> > If I have a hash map and try sc.parallelize, it gives
>>>>>>> >
>>>>>>> > <console>:17: error: type mismatch;
>>>>>>> >  found   : scala.collection.mutable.HashMap[String,Double]
>>>>>>> >  required: Seq[?]
>>>>>>> > Error occurred in an application involving default arguments.
>>>>>>> >        val cr_rdd = sc.parallelize(cr)
>>>>>>> >                                    ^
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to create RDD over hashmap?

Posted by Manoj Samel <ma...@gmail.com>.

How would I create a PairRDD ?


On Fri, Jan 24, 2014 at 1:54 PM, Tathagata Das
<ta...@gmail.com>wrote:

> On this note, you can do something smarter that the basic lookup function.
> You could convert each partition of the key-value pair RDD into a hashmap
> using something like
>
> val rddOfHashmaps = pairRDD.mapPartitions(iterator => {
>    val hashmap = new HashMap[String, ArrayBuffer[Double]]
>    iterator.foreach { case (key, value}  => hashmap.getOrElseUpdate(key,
> new ArrayBuffer[Double]) += value
>    Iterator(hashmap)
>  }, preserveParitioning = true)
>
> And then you can do a variation of the lookup function<https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549>to lookup the right partition, and then within that partition directly
> lookup the hashmap and return the value (rather than scanning the whole
> partition). That give practically O(1) lookup time instead of O(N). But i
> doubt it will match something that a dedicated lookup system like memcached
> would achieve.
>
> TD
>
>
>
>
> On Fri, Jan 24, 2014 at 1:36 PM, Andrew Ash <an...@andrewash.com> wrote:
>
>> By my reading of the code, it uses the partitioner to decide which worker
>> the key lands on, then does an O(N) scan of that partition.  I think we're
>> saying the same thing.
>>
>>
>> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549
>>
>>
>> On Fri, Jan 24, 2014 at 1:26 PM, Cheng Lian <rh...@gmail.com>wrote:
>>
>>> PairRDDFunctions.lookup is good enough in Spark, it's just that its time
>>> complexity is O(N).  Of course, for RDDs equipped with a partitioner, N is
>>> the average size of a partition.
>>>
>>>
>>> On Sat, Jan 25, 2014 at 5:16 AM, Andrew Ash <an...@andrewash.com>wrote:
>>>
>>>> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup()
>>>> method on it for faster access.
>>>>
>>>>
>>>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
>>>>
>>>> Spark's strength is running computations across a large set of data.
>>>>  If you're trying to do fast lookup of a few individual keys, I'd recommend
>>>> something more like memcached or Elasticsearch.
>>>>
>>>>
>>>> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <ma...@gmail.com>wrote:
>>>>
>>>>> Yes, that works.
>>>>>
>>>>> But then the hashmap functionality of the fast key lookup etc. is gone
>>>>> and the search will be linear using a iterator etc. Not sure if Spark
>>>>> internally creates additional optimizations for Seq but otherwise one has
>>>>> to assume this becomes a List/Array without a fast key lookup of a hashmap
>>>>> or b-tree
>>>>>
>>>>> Any thoughts ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
>>>>> fnothaft@berkeley.edu> wrote:
>>>>>
>>>>>> Manoj,
>>>>>>
>>>>>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t
>>>>>> you just do:
>>>>>>
>>>>>> val cr_rdd = sc.parallelize(cr.toSeq)
>>>>>>
>>>>>> The toSeq would convert the HashMap[String,Double] into a
>>>>>> Seq[(String, Double)] before calling the parallelize function.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Frank Austin Nothaft
>>>>>> fnothaft@berkeley.edu
>>>>>> fnothaft@eecs.berkeley.edu
>>>>>> 202-340-0466
>>>>>>
>>>>>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> > Is there a way to create RDD over a hashmap ?
>>>>>> >
>>>>>> > If I have a hash map and try sc.parallelize, it gives
>>>>>> >
>>>>>> > <console>:17: error: type mismatch;
>>>>>> >  found   : scala.collection.mutable.HashMap[String,Double]
>>>>>> >  required: Seq[?]
>>>>>> > Error occurred in an application involving default arguments.
>>>>>> >        val cr_rdd = sc.parallelize(cr)
>>>>>> >                                    ^
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to create RDD over hashmap?

Posted by Tathagata Das <ta...@gmail.com>.

On this note, you can do something smarter that the basic lookup function.
You could convert each partition of the key-value pair RDD into a hashmap
using something like

val rddOfHashmaps = pairRDD.mapPartitions(iterator => {
   val hashmap = new HashMap[String, ArrayBuffer[Double]]
   iterator.foreach { case (key, value}  => hashmap.getOrElseUpdate(key,
new ArrayBuffer[Double]) += value
   Iterator(hashmap)
 }, preserveParitioning = true)

And then you can do a variation of the lookup
function<https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549>to
lookup the right partition, and then within that partition directly
lookup the hashmap and return the value (rather than scanning the whole
partition). That give practically O(1) lookup time instead of O(N). But i
doubt it will match something that a dedicated lookup system like memcached
would achieve.

TD




On Fri, Jan 24, 2014 at 1:36 PM, Andrew Ash <an...@andrewash.com> wrote:

> By my reading of the code, it uses the partitioner to decide which worker
> the key lands on, then does an O(N) scan of that partition.  I think we're
> saying the same thing.
>
>
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549
>
>
> On Fri, Jan 24, 2014 at 1:26 PM, Cheng Lian <rh...@gmail.com> wrote:
>
>> PairRDDFunctions.lookup is good enough in Spark, it's just that its time
>> complexity is O(N).  Of course, for RDDs equipped with a partitioner, N is
>> the average size of a partition.
>>
>>
>> On Sat, Jan 25, 2014 at 5:16 AM, Andrew Ash <an...@andrewash.com> wrote:
>>
>>> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup()
>>> method on it for faster access.
>>>
>>>
>>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
>>>
>>> Spark's strength is running computations across a large set of data.  If
>>> you're trying to do fast lookup of a few individual keys, I'd recommend
>>> something more like memcached or Elasticsearch.
>>>
>>>
>>> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <ma...@gmail.com>wrote:
>>>
>>>> Yes, that works.
>>>>
>>>> But then the hashmap functionality of the fast key lookup etc. is gone
>>>> and the search will be linear using a iterator etc. Not sure if Spark
>>>> internally creates additional optimizations for Seq but otherwise one has
>>>> to assume this becomes a List/Array without a fast key lookup of a hashmap
>>>> or b-tree
>>>>
>>>> Any thoughts ?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
>>>> fnothaft@berkeley.edu> wrote:
>>>>
>>>>> Manoj,
>>>>>
>>>>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t
>>>>> you just do:
>>>>>
>>>>> val cr_rdd = sc.parallelize(cr.toSeq)
>>>>>
>>>>> The toSeq would convert the HashMap[String,Double] into a Seq[(String,
>>>>> Double)] before calling the parallelize function.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Frank Austin Nothaft
>>>>> fnothaft@berkeley.edu
>>>>> fnothaft@eecs.berkeley.edu
>>>>> 202-340-0466
>>>>>
>>>>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Is there a way to create RDD over a hashmap ?
>>>>> >
>>>>> > If I have a hash map and try sc.parallelize, it gives
>>>>> >
>>>>> > <console>:17: error: type mismatch;
>>>>> >  found   : scala.collection.mutable.HashMap[String,Double]
>>>>> >  required: Seq[?]
>>>>> > Error occurred in an application involving default arguments.
>>>>> >        val cr_rdd = sc.parallelize(cr)
>>>>> >                                    ^
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to create RDD over hashmap?

Posted by Andrew Ash <an...@andrewash.com>.

By my reading of the code, it uses the partitioner to decide which worker
the key lands on, then does an O(N) scan of that partition.  I think we're
saying the same thing.

https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549


On Fri, Jan 24, 2014 at 1:26 PM, Cheng Lian <rh...@gmail.com> wrote:

> PairRDDFunctions.lookup is good enough in Spark, it's just that its time
> complexity is O(N).  Of course, for RDDs equipped with a partitioner, N is
> the average size of a partition.
>
>
> On Sat, Jan 25, 2014 at 5:16 AM, Andrew Ash <an...@andrewash.com> wrote:
>
>> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup()
>> method on it for faster access.
>>
>>
>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
>>
>> Spark's strength is running computations across a large set of data.  If
>> you're trying to do fast lookup of a few individual keys, I'd recommend
>> something more like memcached or Elasticsearch.
>>
>>
>> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <ma...@gmail.com>wrote:
>>
>>> Yes, that works.
>>>
>>> But then the hashmap functionality of the fast key lookup etc. is gone
>>> and the search will be linear using a iterator etc. Not sure if Spark
>>> internally creates additional optimizations for Seq but otherwise one has
>>> to assume this becomes a List/Array without a fast key lookup of a hashmap
>>> or b-tree
>>>
>>> Any thoughts ?
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
>>> fnothaft@berkeley.edu> wrote:
>>>
>>>> Manoj,
>>>>
>>>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you
>>>> just do:
>>>>
>>>> val cr_rdd = sc.parallelize(cr.toSeq)
>>>>
>>>> The toSeq would convert the HashMap[String,Double] into a Seq[(String,
>>>> Double)] before calling the parallelize function.
>>>>
>>>> Regards,
>>>>
>>>> Frank Austin Nothaft
>>>> fnothaft@berkeley.edu
>>>> fnothaft@eecs.berkeley.edu
>>>> 202-340-0466
>>>>
>>>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com>
>>>> wrote:
>>>>
>>>> > Is there a way to create RDD over a hashmap ?
>>>> >
>>>> > If I have a hash map and try sc.parallelize, it gives
>>>> >
>>>> > <console>:17: error: type mismatch;
>>>> >  found   : scala.collection.mutable.HashMap[String,Double]
>>>> >  required: Seq[?]
>>>> > Error occurred in an application involving default arguments.
>>>> >        val cr_rdd = sc.parallelize(cr)
>>>> >                                    ^
>>>>
>>>>
>>>
>>
>

Re: How to create RDD over hashmap?

Posted by Cheng Lian <rh...@gmail.com>.

PairRDDFunctions.lookup is good enough in Spark, it's just that its time
complexity is O(N).  Of course, for RDDs equipped with a partitioner, N is
the average size of a partition.


On Sat, Jan 25, 2014 at 5:16 AM, Andrew Ash <an...@andrewash.com> wrote:

> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup() method
> on it for faster access.
>
>
> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
>
> Spark's strength is running computations across a large set of data.  If
> you're trying to do fast lookup of a few individual keys, I'd recommend
> something more like memcached or Elasticsearch.
>
>
> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <ma...@gmail.com>wrote:
>
>> Yes, that works.
>>
>> But then the hashmap functionality of the fast key lookup etc. is gone
>> and the search will be linear using a iterator etc. Not sure if Spark
>> internally creates additional optimizations for Seq but otherwise one has
>> to assume this becomes a List/Array without a fast key lookup of a hashmap
>> or b-tree
>>
>> Any thoughts ?
>>
>>
>>
>>
>>
>> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
>> fnothaft@berkeley.edu> wrote:
>>
>>> Manoj,
>>>
>>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you
>>> just do:
>>>
>>> val cr_rdd = sc.parallelize(cr.toSeq)
>>>
>>> The toSeq would convert the HashMap[String,Double] into a Seq[(String,
>>> Double)] before calling the parallelize function.
>>>
>>> Regards,
>>>
>>> Frank Austin Nothaft
>>> fnothaft@berkeley.edu
>>> fnothaft@eecs.berkeley.edu
>>> 202-340-0466
>>>
>>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com>
>>> wrote:
>>>
>>> > Is there a way to create RDD over a hashmap ?
>>> >
>>> > If I have a hash map and try sc.parallelize, it gives
>>> >
>>> > <console>:17: error: type mismatch;
>>> >  found   : scala.collection.mutable.HashMap[String,Double]
>>> >  required: Seq[?]
>>> > Error occurred in an application involving default arguments.
>>> >        val cr_rdd = sc.parallelize(cr)
>>> >                                    ^
>>>
>>>
>>
>

Re: How to create RDD over hashmap?

Posted by Cheng Lian <rh...@gmail.com>.

You can simply call myKeys.collectAsMap(), which is a function in
PairRDDFunctions, but pay attention that if multiple pairs with the same
key exist in the RDD, only the final one appears in the resulted Map object.


On Sat, Jan 25, 2014 at 5:31 AM, Guillaume Pitel <guillaume.pitel@exensa.com
> wrote:

>  Related question about this kind of problems : what is the best way to
> get the mappings of a list of keys ?
>
> Does this make sense ? :
>
> val myKeys=sc.parallelize(List(("query1",None),("query2",None)))
> val resolved = myKeys.leftJoin(dictionary)
>
> Guillaume
>
> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup() method
> on it for faster access.
>
>
> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
>
>  Spark's strength is running computations across a large set of data.  If
> you're trying to do fast lookup of a few individual keys, I'd recommend
> something more like memcached or Elasticsearch.
>
>
> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <ma...@gmail.com>wrote:
>
>> Yes, that works.
>>
>>  But then the hashmap functionality of the fast key lookup etc. is gone
>> and the search will be linear using a iterator etc. Not sure if Spark
>> internally creates additional optimizations for Seq but otherwise one has
>> to assume this becomes a List/Array without a fast key lookup of a hashmap
>> or b-tree
>>
>>  Any thoughts ?
>>
>>
>>
>>
>>
>> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
>> fnothaft@berkeley.edu> wrote:
>>
>>> Manoj,
>>>
>>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you
>>> just do:
>>>
>>> val cr_rdd = sc.parallelize(cr.toSeq)
>>>
>>> The toSeq would convert the HashMap[String,Double] into a Seq[(String,
>>> Double)] before calling the parallelize function.
>>>
>>> Regards,
>>>
>>> Frank Austin Nothaft
>>> fnothaft@berkeley.edu
>>> fnothaft@eecs.berkeley.edu
>>> 202-340-0466
>>>
>>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com>
>>> wrote:
>>>
>>> > Is there a way to create RDD over a hashmap ?
>>> >
>>> > If I have a hash map and try sc.parallelize, it gives
>>> >
>>> > <console>:17: error: type mismatch;
>>> >  found   : scala.collection.mutable.HashMap[String,Double]
>>> >  required: Seq[?]
>>> > Error occurred in an application involving default arguments.
>>> >        val cr_rdd = sc.parallelize(cr)
>>> >                                    ^
>>>
>>>
>>
>
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>
>  eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>

Re: How to create RDD over hashmap?

Posted by Guillaume Pitel <gu...@exensa.com>.

Related question about this kind of problems : what is the best way to get the
mappings of a list of keys ?

Does this make sense ? :

val myKeys=sc.parallelize(List(("query1",None),("query2",None)))
val resolved = myKeys.leftJoin(dictionary)

Guillaume
> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup() method on
> it for faster access.
>
> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
>
> Spark's strength is running computations across a large set of data.  If
> you're trying to do fast lookup of a few individual keys, I'd recommend
> something more like memcached or Elasticsearch.
>
>
> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <manojsameltech@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Yes, that works.
>
>     But then the hashmap functionality of the fast key lookup etc. is gone and
>     the search will be linear using a iterator etc. Not sure if Spark
>     internally creates additional optimizations for Seq but otherwise one has
>     to assume this becomes a List/Array without a fast key lookup of a hashmap
>     or b-tree 
>
>     Any thoughts ?
>
>
>
>
>
>     On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft
>     <fnothaft@berkeley.edu <ma...@berkeley.edu>> wrote:
>
>         Manoj,
>
>         I assume you’re trying to create an RDD[(String, Double)]? Couldn’t
>         you just do:
>
>         val cr_rdd = sc.parallelize(cr.toSeq)
>
>         The toSeq would convert the HashMap[String,Double] into a Seq[(String,
>         Double)] before calling the parallelize function.
>
>         Regards,
>
>         Frank Austin Nothaft
>         fnothaft@berkeley.edu <ma...@berkeley.edu>
>         fnothaft@eecs.berkeley.edu <ma...@eecs.berkeley.edu>
>         202-340-0466 <tel:202-340-0466>
>
>         On Jan 24, 2014, at 12:56 PM, Manoj Samel <manojsameltech@gmail.com
>         <ma...@gmail.com>> wrote:
>
>         > Is there a way to create RDD over a hashmap ?
>         >
>         > If I have a hash map and try sc.parallelize, it gives
>         >
>         > <console>:17: error: type mismatch;
>         >  found   : scala.collection.mutable.HashMap[String,Double]
>         >  required: Seq[?]
>         > Error occurred in an application involving default arguments.
>         >        val cr_rdd = sc.parallelize(cr)
>         >                                    ^
>
>
>


-- 
eXenSa

	
*Guillaume PITEL, Président*
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Re: How to create RDD over hashmap?

Posted by Andrew Ash <an...@andrewash.com>.

If you have a pair RDD (an RDD[A,B]) then you can use the .lookup() method
on it for faster access.

http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions

Spark's strength is running computations across a large set of data.  If
you're trying to do fast lookup of a few individual keys, I'd recommend
something more like memcached or Elasticsearch.


On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <ma...@gmail.com>wrote:

> Yes, that works.
>
> But then the hashmap functionality of the fast key lookup etc. is gone and
> the search will be linear using a iterator etc. Not sure if Spark
> internally creates additional optimizations for Seq but otherwise one has
> to assume this becomes a List/Array without a fast key lookup of a hashmap
> or b-tree
>
> Any thoughts ?
>
>
>
>
>
> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
> fnothaft@berkeley.edu> wrote:
>
>> Manoj,
>>
>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you
>> just do:
>>
>> val cr_rdd = sc.parallelize(cr.toSeq)
>>
>> The toSeq would convert the HashMap[String,Double] into a Seq[(String,
>> Double)] before calling the parallelize function.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu
>> fnothaft@eecs.berkeley.edu
>> 202-340-0466
>>
>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com>
>> wrote:
>>
>> > Is there a way to create RDD over a hashmap ?
>> >
>> > If I have a hash map and try sc.parallelize, it gives
>> >
>> > <console>:17: error: type mismatch;
>> >  found   : scala.collection.mutable.HashMap[String,Double]
>> >  required: Seq[?]
>> > Error occurred in an application involving default arguments.
>> >        val cr_rdd = sc.parallelize(cr)
>> >                                    ^
>>
>>
>

Re: How to create RDD over hashmap?

Posted by Manoj Samel <ma...@gmail.com>.

Yes, that works.

But then the hashmap functionality of the fast key lookup etc. is gone and
the search will be linear using a iterator etc. Not sure if Spark
internally creates additional optimizations for Seq but otherwise one has
to assume this becomes a List/Array without a fast key lookup of a hashmap
or b-tree

Any thoughts ?





On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <fnothaft@berkeley.edu
> wrote:

> Manoj,
>
> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you
> just do:
>
> val cr_rdd = sc.parallelize(cr.toSeq)
>
> The toSeq would convert the HashMap[String,Double] into a Seq[(String,
> Double)] before calling the parallelize function.
>
> Regards,
>
> Frank Austin Nothaft
> fnothaft@berkeley.edu
> fnothaft@eecs.berkeley.edu
> 202-340-0466
>
> On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com>
> wrote:
>
> > Is there a way to create RDD over a hashmap ?
> >
> > If I have a hash map and try sc.parallelize, it gives
> >
> > <console>:17: error: type mismatch;
> >  found   : scala.collection.mutable.HashMap[String,Double]
> >  required: Seq[?]
> > Error occurred in an application involving default arguments.
> >        val cr_rdd = sc.parallelize(cr)
> >                                    ^
>
>

Re: How to create RDD over hashmap?

Posted by Frank Austin Nothaft <fn...@berkeley.edu>.

Manoj,

I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you just do:

val cr_rdd = sc.parallelize(cr.toSeq)

The toSeq would convert the HashMap[String,Double] into a Seq[(String, Double)] before calling the parallelize function.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

On Jan 24, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com> wrote:

> Is there a way to create RDD over a hashmap ?
> 
> If I have a hash map and try sc.parallelize, it gives 
> 
> <console>:17: error: type mismatch;
>  found   : scala.collection.mutable.HashMap[String,Double]
>  required: Seq[?]
> Error occurred in an application involving default arguments.
>        val cr_rdd = sc.parallelize(cr)
>                                    ^