You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by kevin <ki...@gmail.com> on 2016/07/26 07:30:14 UTC

dataframe.foreach VS dataframe.collect().foreach

HI ALL:
I don't quite understand the different between : dataframe.foreach and
dataframe.collect().foreach . When to use dataframe.foreach?

I use spark2.0 ,I want to iterate a dataframe to get one colum's value :

this can work out

     blacklistDF.collect().foreach { x =>
        println(s">>>>>>>getString(0)" + x.getAs[String]("uid"))
        val put = new Put(Bytes.toBytes(x.getAs[String]("uid")));
        put.add(Bytes.toBytes("cf"), Bytes.toBytes("uid"),
Bytes.toBytes(x.getAs[String]("uid")))
        hrecords.add(put)
      }

if I use blacklistDF.foreach {....} I will get nothing

Re: dataframe.foreach VS dataframe.collect().foreach

Posted by Pedro Rodriguez <sk...@gmail.com>.

:)

Just realized you didn't get your original question answered though:

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> case class Person(age: Long, name: String)
defined class Person

scala> val df = Seq(Person(24, "pedro"), Person(22, "fritz")).toDF()
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.select("age")
res2: org.apache.spark.sql.DataFrame = [age: bigint]

scala> df.select("age").collect.map(_.getLong(0))
res3: Array[Long] = Array(24, 22)

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> df.collect.flatMap {
     | case Row(age: Long, name: String) => Seq(Tuple1(age))
     | case _ => Seq()
     | }
res7: Array[(Long,)] = Array((24,), (22,))

These docs are helpful
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row
(1.6 docs, but should be similar in 2.0)

On Tue, Jul 26, 2016 at 7:08 AM, Gourav Sengupta <go...@gmail.com>
wrote:

> And Pedro has made sense of a world running amok, scared, and drunken
> stupor.
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 2:01 PM, Pedro Rodriguez <sk...@gmail.com>
> wrote:
>
>> I am not 100% as I haven't tried this out, but there is a huge difference
>> between the two. Both foreach and collect are actions irregardless of
>> whether or not the data frame is empty.
>>
>> Doing a collect will bring all the results back to the driver, possibly
>> forcing it to run out of memory. Foreach will apply your function to each
>> element of the DataFrame, but will do so across the cluster. This behavior
>> is useful for when you need to do something custom for each element
>> (perhaps save to a db for which there is no driver or something custom like
>> make an http request per element, careful here though due to overhead cost).
>>
>> In your example, I am going to assume that hrecords is something like a
>> list buffer. The reason that will be empty is that each worker will get
>> sent an empty list (its captured in the closure for foreach) and append to
>> it. The instance of the list at the driver doesn't know about what happened
>> at the workers so its empty.
>>
>> I don't know why Chanh's comment applies here since I am guessing the df
>> is not empty.
>>
>> On Tue, Jul 26, 2016 at 1:53 AM, kevin <ki...@gmail.com> wrote:
>>
>>> thank you Chanh
>>>
>>> 2016-07-26 15:34 GMT+08:00 Chanh Le <gi...@gmail.com>:
>>>
>>>> Hi Ken,
>>>>
>>>> *blacklistDF -> just DataFrame *
>>>> Spark is lazy until you call something like* collect, take, write* it
>>>> will execute the hold process *like you do map or filter before you
>>>> collect*.
>>>> That mean until you call collect spark* do nothing* so you df would
>>>> not have any data -> can’t call foreach.
>>>> Call collect execute the process -> get data -> foreach is ok.
>>>>
>>>>
>>>> On Jul 26, 2016, at 2:30 PM, kevin <ki...@gmail.com> wrote:
>>>>
>>>>  blacklistDF.collect()
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: dataframe.foreach VS dataframe.collect().foreach

Posted by Gourav Sengupta <go...@gmail.com>.

And Pedro has made sense of a world running amok, scared, and drunken
stupor.

Regards,
Gourav

On Tue, Jul 26, 2016 at 2:01 PM, Pedro Rodriguez <sk...@gmail.com>
wrote:

> I am not 100% as I haven't tried this out, but there is a huge difference
> between the two. Both foreach and collect are actions irregardless of
> whether or not the data frame is empty.
>
> Doing a collect will bring all the results back to the driver, possibly
> forcing it to run out of memory. Foreach will apply your function to each
> element of the DataFrame, but will do so across the cluster. This behavior
> is useful for when you need to do something custom for each element
> (perhaps save to a db for which there is no driver or something custom like
> make an http request per element, careful here though due to overhead cost).
>
> In your example, I am going to assume that hrecords is something like a
> list buffer. The reason that will be empty is that each worker will get
> sent an empty list (its captured in the closure for foreach) and append to
> it. The instance of the list at the driver doesn't know about what happened
> at the workers so its empty.
>
> I don't know why Chanh's comment applies here since I am guessing the df
> is not empty.
>
> On Tue, Jul 26, 2016 at 1:53 AM, kevin <ki...@gmail.com> wrote:
>
>> thank you Chanh
>>
>> 2016-07-26 15:34 GMT+08:00 Chanh Le <gi...@gmail.com>:
>>
>>> Hi Ken,
>>>
>>> *blacklistDF -> just DataFrame *
>>> Spark is lazy until you call something like* collect, take, write* it
>>> will execute the hold process *like you do map or filter before you
>>> collect*.
>>> That mean until you call collect spark* do nothing* so you df would not
>>> have any data -> can’t call foreach.
>>> Call collect execute the process -> get data -> foreach is ok.
>>>
>>>
>>> On Jul 26, 2016, at 2:30 PM, kevin <ki...@gmail.com> wrote:
>>>
>>>  blacklistDF.collect()
>>>
>>>
>>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Re: dataframe.foreach VS dataframe.collect().foreach

Posted by Pedro Rodriguez <sk...@gmail.com>.

I am not 100% as I haven't tried this out, but there is a huge difference
between the two. Both foreach and collect are actions irregardless of
whether or not the data frame is empty.

Doing a collect will bring all the results back to the driver, possibly
forcing it to run out of memory. Foreach will apply your function to each
element of the DataFrame, but will do so across the cluster. This behavior
is useful for when you need to do something custom for each element
(perhaps save to a db for which there is no driver or something custom like
make an http request per element, careful here though due to overhead cost).

In your example, I am going to assume that hrecords is something like a
list buffer. The reason that will be empty is that each worker will get
sent an empty list (its captured in the closure for foreach) and append to
it. The instance of the list at the driver doesn't know about what happened
at the workers so its empty.

I don't know why Chanh's comment applies here since I am guessing the df is
not empty.

On Tue, Jul 26, 2016 at 1:53 AM, kevin <ki...@gmail.com> wrote:

> thank you Chanh
>
> 2016-07-26 15:34 GMT+08:00 Chanh Le <gi...@gmail.com>:
>
>> Hi Ken,
>>
>> *blacklistDF -> just DataFrame *
>> Spark is lazy until you call something like* collect, take, write* it
>> will execute the hold process *like you do map or filter before you
>> collect*.
>> That mean until you call collect spark* do nothing* so you df would not
>> have any data -> can’t call foreach.
>> Call collect execute the process -> get data -> foreach is ok.
>>
>>
>> On Jul 26, 2016, at 2:30 PM, kevin <ki...@gmail.com> wrote:
>>
>>  blacklistDF.collect()
>>
>>
>>
>

-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: dataframe.foreach VS dataframe.collect().foreach

Posted by kevin <ki...@gmail.com>.

thank you Chanh

2016-07-26 15:34 GMT+08:00 Chanh Le <gi...@gmail.com>:

> Hi Ken,
>
> *blacklistDF -> just DataFrame *
> Spark is lazy until you call something like* collect, take, write* it
> will execute the hold process *like you do map or filter before you
> collect*.
> That mean until you call collect spark* do nothing* so you df would not
> have any data -> can’t call foreach.
> Call collect execute the process -> get data -> foreach is ok.
>
>
> On Jul 26, 2016, at 2:30 PM, kevin <ki...@gmail.com> wrote:
>
>  blacklistDF.collect()
>
>
>

Re: dataframe.foreach VS dataframe.collect().foreach

Posted by Chanh Le <gi...@gmail.com>.

Hi Ken,

blacklistDF -> just DataFrame 
Spark is lazy until you call something like collect, take, write it will execute the hold process like you do map or filter before you collect.
That mean until you call collect spark do nothing so you df would not have any data -> can’t call foreach.
Call collect execute the process -> get data -> foreach is ok.

> On Jul 26, 2016, at 2:30 PM, kevin <ki...@gmail.com> wrote:
> 
>  blacklistDF.collect()