You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chitturi Padma <le...@gmail.com> on 2016/02/24 18:47:51 UTC

Re: rdd.collect.foreach() vs rdd.collect.map()

rdd.collect() never does any processing on the workers. It brings the
entire rdd as an in-memory collection back to driver

On Wed, Feb 24, 2016 at 10:58 PM, Anurag [via Apache Spark User List] <
ml-node+s1001560n26320h65@n3.nabble.com> wrote:

> Hi Everyone
>
> I am new to Scala and Spark.
>
> I want to know
>
> 1. does Rdd.collect().foreach() do processing in parallel?
>
> 2. does Rdd.collect().map() do processing in parallel ?
>
> Thanks in advance.
> Regards
> Anurag
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1h76@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320p26322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: rdd.collect.foreach() vs rdd.collect.map()

Posted by Chitturi Padma <le...@gmail.com>.

If you want to do processing in parallel, never use collect or any action
such as count or first, they compute the result and bring it back to
driver. rdd.map does processing in parallel. Once you have processed rdd
then save it to DB.

 rdd.foreach executes on the workers, Infact, it returns unit.



On Wed, Feb 24, 2016 at 11:56 PM, Anurag [via Apache Spark User List] <
ml-node+s1001560n26325h20@n3.nabble.com> wrote:

> @Chitturi-Thanks a lot for replying
>
> 2 followup questions :
>
> 1. what if I am not collecting Rdd, then will Rdd.foreach() and Rdd.map()
> do processing in parallel ?
>
>
> 2. Let's say I have to get the results first and then do something before
> saving them into database. But I want to do that in parallel? How should I
> do it ? I am using Rdd.collect().foreach(....), but it is not doing
> processing in parallel.
>
> Regards
> Anurag
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320p26325.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1h76@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320p26326.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.