You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SRK <sw...@gmail.com> on 2016/02/09 22:58:18 UTC

How to collect/take arbitrary number of records in the driver?

Hi ,

How to get a fixed amount of records from an RDD in Driver? Suppose I want
the records from 100 to 1000 and then save them to some external database, I
know that I can do it from Workers in partition but I want to avoid that for
some reasons. The idea is to collect the data to driver and save, although
slowly.

I am looking for something like take(100, 1000)  or take (1000,2000)

Thanks,
Swetha



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-collect-take-arbitrary-number-of-records-in-the-driver-tp26184.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to collect/take arbitrary number of records in the driver?

Posted by Jakob Odersky <ja...@odersky.com>.

Another alternative:

rdd.take(1000).drop(100) //this also preserves ordering

Note however that this can lead to an OOM if the data you're taking is
too large. If you want to perform some operation sequentially on your
driver and don't care about performance, you could do something
similar as Mohammed suggested:

val filteredRDD = //same as previous post

filteredRDD.foreach{ elem =>
  // do something with elem, e.g. save to database
}



On Tue, Feb 9, 2016 at 2:56 PM, Mohammed Guller <mo...@glassbeam.com> wrote:
> You can do something like this:
>
>
>
> val indexedRDD = rdd.zipWithIndex
>
> val filteredRDD = indexedRDD.filter{case(element, index) => (index >= 99) &&
> (index < 199)}
>
> val result = filteredRDD.take(100)
>
>
>
> Warning: the ordering of the elements in the RDD is not guaranteed.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
>
>
>
> -----Original Message-----
> From: SRK [mailto:swethakasireddy@gmail.com]
> Sent: Tuesday, February 9, 2016 1:58 PM
> To: user@spark.apache.org
> Subject: How to collect/take arbitrary number of records in the driver?
>
>
>
> Hi ,
>
>
>
> How to get a fixed amount of records from an RDD in Driver? Suppose I want
> the records from 100 to 1000 and then save them to some external database, I
> know that I can do it from Workers in partition but I want to avoid that for
> some reasons. The idea is to collect the data to driver and save, although
> slowly.
>
>
>
> I am looking for something like take(100, 1000)  or take (1000,2000)
>
>
>
> Thanks,
>
> Swetha
>
>
>
>
>
>
>
> --
>
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-collect-take-arbitrary-number-of-records-in-the-driver-tp26184.html
>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: How to collect/take arbitrary number of records in the driver?

Posted by Mohammed Guller <mo...@glassbeam.com>.

You can do something like this:



val indexedRDD = rdd.zipWithIndex

val filteredRDD = indexedRDD.filter{case(element, index) => (index >= 99) && (index < 199)}

val result = filteredRDD.take(100)



Warning: the ordering of the elements in the RDD is not guaranteed.

Mohammed
Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>



-----Original Message-----
From: SRK [mailto:swethakasireddy@gmail.com]
Sent: Tuesday, February 9, 2016 1:58 PM
To: user@spark.apache.org
Subject: How to collect/take arbitrary number of records in the driver?



Hi ,



How to get a fixed amount of records from an RDD in Driver? Suppose I want the records from 100 to 1000 and then save them to some external database, I know that I can do it from Workers in partition but I want to avoid that for some reasons. The idea is to collect the data to driver and save, although slowly.



I am looking for something like take(100, 1000)  or take (1000,2000)



Thanks,

Swetha







--

View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-collect-take-arbitrary-number-of-records-in-the-driver-tp26184.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



---------------------------------------------------------------------

To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org> For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>