You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by dave-anderson <da...@pobox.com> on 2014/09/19 04:58:06 UTC

paging through an RDD that's too large to collect() all at once

I have an RDD on the cluster that I'd like to iterate over and perform some
operations on each element (push data from each element to another
downstream system outside of Spark).  I'd like to do this at the driver so I
can throttle the rate that I push to the downstream system (as opposed to
submitting a job to the Spark cluster and parallelizing the work - and
likely flooding the downstream system).

The RDD is too big to collect() all at once back into the memory space of
the driver.  Ideally I'd like to be able to "page" through the dataset,
iterating through a chunk of n RDD elements at a time back at the driver. 
It doesn't have to be _exactly_ n elements, just a reasonably small set of
elements at a time.

Is there a simple way to do this?

It looks like I could use RDD.filter() or RDD.collect[U](f:
PartialFunction[T, U]).  Either of those techniques requires defining a
function that will filter the RDD.  But the shape of the data in the RDD
could be such that, for a given function (say splitting by timestamp by hour
of day), it won't reliably split up into reasonably sized "pages".  Also, it
requires doing some analysis to determine boundaries on the data for
filtering.  Point is, it's extra logic

Any thoughts on if there's a simpler way to page through RDD elements back
at the driver?




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: paging through an RDD that's too large to collect() all at once

Posted by Dave Anderson <da...@pobox.com>.
Excellent - thats exactly what I needed.   I saw iterator() but missed the
toLocalIterator() method



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638p14686.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: paging through an RDD that's too large to collect() all at once

Posted by Matei Zaharia <ma...@gmail.com>.
Hey Dave, try out RDD.toLocalIterator -- it gives you an iterator that reads one RDD partition at a time. Scala iterators also have methods like grouped() that let you get fixed-size groups.

Matei

On September 18, 2014 at 7:58:34 PM, dave-anderson (david.anderson@pobox.com) wrote:

I have an RDD on the cluster that I'd like to iterate over and perform some 
operations on each element (push data from each element to another 
downstream system outside of Spark). I'd like to do this at the driver so I 
can throttle the rate that I push to the downstream system (as opposed to 
submitting a job to the Spark cluster and parallelizing the work - and 
likely flooding the downstream system). 

The RDD is too big to collect() all at once back into the memory space of 
the driver. Ideally I'd like to be able to "page" through the dataset, 
iterating through a chunk of n RDD elements at a time back at the driver. 
It doesn't have to be _exactly_ n elements, just a reasonably small set of 
elements at a time. 

Is there a simple way to do this? 

It looks like I could use RDD.filter() or RDD.collect[U](f: 
PartialFunction[T, U]). Either of those techniques requires defining a 
function that will filter the RDD. But the shape of the data in the RDD 
could be such that, for a given function (say splitting by timestamp by hour 
of day), it won't reliably split up into reasonably sized "pages". Also, it 
requires doing some analysis to determine boundaries on the data for 
filtering. Point is, it's extra logic 

Any thoughts on if there's a simpler way to page through RDD elements back 
at the driver? 




-- 
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638.html 
Sent from the Apache Spark User List mailing list archive at Nabble.com. 

--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
For additional commands, e-mail: user-help@spark.apache.org