You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Nilesh <ni...@nileshc.com> on 2014/05/26 00:55:49 UTC

Re: all values for a key must fit in memory

I would like to clarify something. Matei mentioned that in Spark 1.0 groupBy
returns an (Key, Iterable[Value]) instead of (Key, Seq[Value]). Does this
also automatically assure us that the whole Iterable[Value] is not in fact
stored in memory? That is to say, with 1.0, will it be possible to do
groupByKey().values.map(x => while(x.hasNext) ... ) assuming x :
Iterable[Value] is larger than the RAM on a single machine? Or will this be
possible later, in subsequent versions?

Could you please propose a workaround for this for the meantime? I'm out of
ideas.

Thanks,
Nilesh



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6791.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: all values for a key must fit in memory

Posted by Nilesh <ni...@nileshc.com>.
Hi Patrick,

In this particular case, at the end of my tasks I have X different types of
keys. I need to write their values to X different files respectively. For
now I'm writing everything to the driver node's local FS.

While the number of key-value pairs can grow to millions (billions?), X is
more or less fixed at 25-30. A groupByKey followed by a map(x:
Iterable[Value] => x.foreach(destination.write(x)) would be great. But then
again, I'm not too sure about serialization issues and more likely that not
this idea would fail, but I'll try it out.

So the toLocalIterator implementation works OK for me here, though it might
turn out to be slow.

Cheers,
Nilesh

PS: Can't wait for 1.0! ^_^ Looks like it's been RC10 till now.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6796.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: all values for a key must fit in memory

Posted by Patrick Wendell <pw...@gmail.com>.
Nilesh - out of curiosity - what operation are you doing on the values
for the key?

On Sun, May 25, 2014 at 6:35 PM, Nilesh <ni...@nileshc.com> wrote:
> Hi Andrew,
>
> Thanks for the reply!
>
> It's clearer about the API part now. That's what I wanted to know.
>
> Wow, tuples, why didn't that occur to me. That's a lovely ugly hack. :) I
> also came across something that solved my real problem though - the
> RDD.toLocalIterator method from 1.0, the logic of which thankfully works
> with 0.9.1 too, no new API changes there.
>
> Cheers,
> Nilesh
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6794.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: all values for a key must fit in memory

Posted by Nilesh <ni...@nileshc.com>.
Hi Andrew,

Thanks for the reply!

It's clearer about the API part now. That's what I wanted to know.

Wow, tuples, why didn't that occur to me. That's a lovely ugly hack. :) I
also came across something that solved my real problem though - the
RDD.toLocalIterator method from 1.0, the logic of which thankfully works
with 0.9.1 too, no new API changes there.

Cheers,
Nilesh



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6794.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: all values for a key must fit in memory

Posted by Andrew Ash <an...@andrewash.com>.
Hi Nilesh,

That change from Matei to change (Key, Seq[Value]) into (Key,
Iterable[Value]) was to enable the optimization in future releases without
breaking the API.  Currently though, all values on a single key are still
held in memory on a single machine.

The way I've gotten around this is by introducing another value to my Key
that goes from (Key) to (Key, randomValue % 10) for example.  Using this
you can further shard an individual key and keep from holding as much data
in memory at once.  The workaround is an ugly hack, but if it works then it
works.

Hope that helps!
Andrew


On Sun, May 25, 2014 at 6:55 PM, Nilesh <ni...@nileshc.com> wrote:

> I would like to clarify something. Matei mentioned that in Spark 1.0
> groupBy
> returns an (Key, Iterable[Value]) instead of (Key, Seq[Value]). Does this
> also automatically assure us that the whole Iterable[Value] is not in fact
> stored in memory? That is to say, with 1.0, will it be possible to do
> groupByKey().values.map(x => while(x.hasNext) ... ) assuming x :
> Iterable[Value] is larger than the RAM on a single machine? Or will this be
> possible later, in subsequent versions?
>
> Could you please propose a workaround for this for the meantime? I'm out of
> ideas.
>
> Thanks,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6791.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>