You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by DB Tsai <db...@alpinenow.com> on 2014/01/21 20:32:00 UTC

Lazy evaluation of RDD data transformation

Hi,

When the data is read from HDFS using textFile, and then map function is
performed as the following code to make the format right in order to feed
it into mllib training algorithms.

rddFile =  sc.textFile("Some file on HDFS")

rddData = rddFile.map(line => {
      val temp = line.toString.split(",")
      val y = temp(3) match {
        case "1" => 0.0
        case "2" => 1.0
        case _ => 2.0
      }
      val x = temp.slice(1, 3).map(_.toDouble)
      LabeledPoint(y, x)
})

My question is that when the map function is performed? Is it lazy
evaluated when we use rddData first time and generate another new dataset
called rddData since RDD is immutable? Does it mean the second time we use
rddData, the transformation isn't computed?

Or the transformation is computed in real time, so we don't need extra
memory for this?

The motivation for asking this question is that I found in mllib library,
there are lots of extra transformation is done. For example, the intercept
is added by map( point -> new LabeledPoint(point.y, Array( 1,
point.feature))

If the new dataset is generated every time when the map is performed, for a
really big dataset, it will waste lots of memory and IO. Also, it will be
less efficiency, when we chain several map function to RDD since all of
them can be done in one place.

Thanks.

Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--------------------------------------
Web: http://alpinenow.com/

Re: Lazy evaluation of RDD data transformation

Posted by Reynold Xin <rx...@apache.org>.
The map computation output is never fully materialized in memory.
Internally, it is simply an iterator interface that streams through the
input and produces an iterator that can be consumed in a similar streaming
fashion.

Only when .cache/persist is set on a RDD would result in the content
produced by the iterator to be fully materialized (i.e. put into an array).


On Tue, Jan 21, 2014 at 12:02 PM, DB Tsai <db...@alpinenow.com> wrote:

> Hi Matei,
>
> It does make sense that the computation will happen over and over each
> time. If I understand correctly, do you mean that it will only compute the
> transformation of one particular line and then destroy it to save the
> memory?
>
> Or it will create the full result of the map operation, and then destroy it
> entirely after using it?
>
> Thanks.
>
>
> Sincerely,
>
> DB Tsai
> Machine Learning Engineer
> Alpine Data Labs
> --------------------------------------
> Web: http://alpinenow.com/
>
>
> On Tue, Jan 21, 2014 at 11:55 AM, Matei Zaharia <ma...@gmail.com>wrote:
>
>> If you don’t cache the RDD, the computation will happen over and over
>> each time we scan through it. This is done to save memory in that case and
>> because Spark can’t know at the beginning whether you plan to access a
>> dataset multiple times. If you’d like to prevent this, use cache(), or
>> maybe persist(StorageLevel.DISK_ONLY) if you don’t want to keep it in
>> memory.
>>
>> Matei
>>
>>
>>
>> On Jan 21, 2014, at 11:32 AM, DB Tsai <db...@alpinenow.com> wrote:
>>
>> Hi,
>>
>> When the data is read from HDFS using textFile, and then map function is
>> performed as the following code to make the format right in order to feed
>> it into mllib training algorithms.
>>
>> rddFile =  sc.textFile("Some file on HDFS")
>>
>> rddData = rddFile.map(line => {
>>       val temp = line.toString.split(",")
>>       val y = temp(3) match {
>>         case "1" => 0.0
>>         case "2" => 1.0
>>         case _ => 2.0
>>       }
>>       val x = temp.slice(1, 3).map(_.toDouble)
>>       LabeledPoint(y, x)
>> })
>>
>> My question is that when the map function is performed? Is it lazy
>> evaluated when we use rddData first time and generate another new dataset
>> called rddData since RDD is immutable? Does it mean the second time we use
>> rddData, the transformation isn't computed?
>>
>> Or the transformation is computed in real time, so we don't need extra
>> memory for this?
>>
>> The motivation for asking this question is that I found in mllib library,
>> there are lots of extra transformation is done. For example, the intercept
>> is added by map( point -> new LabeledPoint(point.y, Array( 1,
>> point.feature))
>>
>> If the new dataset is generated every time when the map is performed, for
>> a really big dataset, it will waste lots of memory and IO. Also, it will be
>> less efficiency, when we chain several map function to RDD since all of
>> them can be done in one place.
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> Machine Learning Engineer
>> Alpine Data Labs
>> --------------------------------------
>> Web: http://alpinenow.com/
>>
>>
>>
>

Re: Lazy evaluation of RDD data transformation

Posted by DB Tsai <db...@alpinenow.com>.
Hi Matei,

It does make sense that the computation will happen over and over each
time. If I understand correctly, do you mean that it will only compute the
transformation of one particular line and then destroy it to save the
memory?

Or it will create the full result of the map operation, and then destroy it
entirely after using it?

Thanks.


Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--------------------------------------
Web: http://alpinenow.com/


On Tue, Jan 21, 2014 at 11:55 AM, Matei Zaharia <ma...@gmail.com>wrote:

> If you don’t cache the RDD, the computation will happen over and over each
> time we scan through it. This is done to save memory in that case and
> because Spark can’t know at the beginning whether you plan to access a
> dataset multiple times. If you’d like to prevent this, use cache(), or
> maybe persist(StorageLevel.DISK_ONLY) if you don’t want to keep it in
> memory.
>
> Matei
>
>
>
> On Jan 21, 2014, at 11:32 AM, DB Tsai <db...@alpinenow.com> wrote:
>
> Hi,
>
> When the data is read from HDFS using textFile, and then map function is
> performed as the following code to make the format right in order to feed
> it into mllib training algorithms.
>
> rddFile =  sc.textFile("Some file on HDFS")
>
> rddData = rddFile.map(line => {
>       val temp = line.toString.split(",")
>       val y = temp(3) match {
>         case "1" => 0.0
>         case "2" => 1.0
>         case _ => 2.0
>       }
>       val x = temp.slice(1, 3).map(_.toDouble)
>       LabeledPoint(y, x)
> })
>
> My question is that when the map function is performed? Is it lazy
> evaluated when we use rddData first time and generate another new dataset
> called rddData since RDD is immutable? Does it mean the second time we use
> rddData, the transformation isn't computed?
>
> Or the transformation is computed in real time, so we don't need extra
> memory for this?
>
> The motivation for asking this question is that I found in mllib library,
> there are lots of extra transformation is done. For example, the intercept
> is added by map( point -> new LabeledPoint(point.y, Array( 1,
> point.feature))
>
> If the new dataset is generated every time when the map is performed, for
> a really big dataset, it will waste lots of memory and IO. Also, it will be
> less efficiency, when we chain several map function to RDD since all of
> them can be done in one place.
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> Machine Learning Engineer
> Alpine Data Labs
> --------------------------------------
> Web: http://alpinenow.com/
>
>
>

Re: Lazy evaluation of RDD data transformation

Posted by Matei Zaharia <ma...@gmail.com>.
If you don’t cache the RDD, the computation will happen over and over each time we scan through it. This is done to save memory in that case and because Spark can’t know at the beginning whether you plan to access a dataset multiple times. If you’d like to prevent this, use cache(), or maybe persist(StorageLevel.DISK_ONLY) if you don’t want to keep it in memory.

Matei


On Jan 21, 2014, at 11:32 AM, DB Tsai <db...@alpinenow.com> wrote:

> Hi,
> 
> When the data is read from HDFS using textFile, and then map function is performed as the following code to make the format right in order to feed it into mllib training algorithms.
> 
> rddFile =  sc.textFile("Some file on HDFS")
>  
> rddData = rddFile.map(line => {
>       val temp = line.toString.split(",")
>       val y = temp(3) match {
>         case "1" => 0.0
>         case "2" => 1.0
>         case _ => 2.0
>       }
>       val x = temp.slice(1, 3).map(_.toDouble)
>       LabeledPoint(y, x)
> })
> 
> My question is that when the map function is performed? Is it lazy evaluated when we use rddData first time and generate another new dataset called rddData since RDD is immutable? Does it mean the second time we use rddData, the transformation isn't computed?
> 
> Or the transformation is computed in real time, so we don't need extra memory for this?
> 
> The motivation for asking this question is that I found in mllib library, there are lots of extra transformation is done. For example, the intercept is added by map( point -> new LabeledPoint(point.y, Array( 1, point.feature))
> 
> If the new dataset is generated every time when the map is performed, for a really big dataset, it will waste lots of memory and IO. Also, it will be less efficiency, when we chain several map function to RDD since all of them can be done in one place.
> 
> Thanks.
> 
> Sincerely,
> 
> DB Tsai
> Machine Learning Engineer
> Alpine Data Labs
> --------------------------------------
> Web: http://alpinenow.com/