You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Egor Pahomov <pa...@gmail.com> on 2014/02/12 10:07:04 UTC

Best practice for retrieving big data from RDD to local machine

Hello. I've got big RDD(1gb) in yarn cluster. On local machine, which use
this cluster I have only 512 mb. I'd like to iterate over values in result
RDD on my local machine. I can't use collect(), because it would create too
big array locally which more then my heap. I need some iterative way. There
is method iterator(), but it requires some additional information, I can't
provide. (
http://stackoverflow.com/questions/21698443/best-practice-for-retrieving-big-data-from-rdd-to-local-machine
)

-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*

Re: Best practice for retrieving big data from RDD to local machine

Posted by Andrew Ash <an...@andrewash.com>.
Hi Egor,

It sounds like you should vote for
https://spark-project.atlassian.net/browse/SPARK-914 which is to make an
RDD iterable from the driver.


On Wed, Feb 12, 2014 at 1:07 AM, Egor Pahomov <pa...@gmail.com>wrote:

> Hello. I've got big RDD(1gb) in yarn cluster. On local machine, which use
> this cluster I have only 512 mb. I'd like to iterate over values in result
> RDD on my local machine. I can't use collect(), because it would create too
> big array locally which more then my heap. I need some iterative way. There
> is method iterator(), but it requires some additional information, I can't
> provide. (
> http://stackoverflow.com/questions/21698443/best-practice-for-retrieving-big-data-from-rdd-to-local-machine
> )
>
> --
>
>
>
> *Sincerely yours Egor PakhomovScala Developer, Yandex*
>