You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Wenlei Xie <we...@gmail.com> on 2013/11/19 10:01:15 UTC

Reuse the Buffer Array in the map function?

Hi,

I am trying to do some tasks with the following style map function:

rdd.map { e =>
    val a = new Array[Int](100)
    ...Some calculation...
}

But here the array a is really just used as a temporary buffer and can be
reused. I am wondering if I can avoid constructing it everytime? (As it
might incur some overhead for JVM?) Would use an array outside the closure
work?

Thank you!

Best,
Wenlei

Re: Reuse the Buffer Array in the map function?

Posted by Wenlei Xie <we...@gmail.com>.

Thank you!

Best,
Wenlei


On Tue, Nov 19, 2013 at 7:20 AM, Mark Hamstra <ma...@clearstorydata.com>wrote:

> mapWith can make this use case even simpler.
>
>
>
> On Nov 19, 2013, at 1:29 AM, Sebastian Schelter <ss...@googlemail.com>
> wrote:
>
> You can use mapPartition, which allows you to apply the map function
> elementwise to all elements of a partition. Here you can place custom code
> around your function invocation that lets you reuse the array.
>
> --sebastian
> Am 19.11.2013 10:02 schrieb "Wenlei Xie" <we...@gmail.com>:
>
>> Hi,
>>
>> I am trying to do some tasks with the following style map function:
>>
>> rdd.map { e =>
>>     val a = new Array[Int](100)
>>     ...Some calculation...
>> }
>>
>> But here the array a is really just used as a temporary buffer and can be
>> reused. I am wondering if I can avoid constructing it everytime? (As it
>> might incur some overhead for JVM?) Would use an array outside the closure
>> work?
>>
>> Thank you!
>>
>> Best,
>> Wenlei
>>
>>


-- 
Wenlei Xie (谢文磊)

Department of Computer Science
5132 Upson Hall, Cornell University
Ithaca, NY 14853, USA
Phone: (607) 255-5577
Email: wenlei.xie@gmail.com

Re: Reuse the Buffer Array in the map function?

Posted by Mark Hamstra <ma...@clearstorydata.com>.

mapWith can make this use case even simpler.



> On Nov 19, 2013, at 1:29 AM, Sebastian Schelter <ss...@googlemail.com> wrote:
> 
> You can use mapPartition, which allows you to apply the map function elementwise to all elements of a partition. Here you can place custom code around your function invocation that lets you reuse the array.
> 
> --sebastian
> 
> Am 19.11.2013 10:02 schrieb "Wenlei Xie" <we...@gmail.com>:
>> Hi,
>> 
>> I am trying to do some tasks with the following style map function:
>> 
>> rdd.map { e =>
>>     val a = new Array[Int](100)
>>     ...Some calculation...
>> }
>> 
>> But here the array a is really just used as a temporary buffer and can be reused. I am wondering if I can avoid constructing it everytime? (As it might incur some overhead for JVM?) Would use an array outside the closure work?
>> 
>> Thank you!
>> 
>> Best,
>> Wenlei

Re: Reuse the Buffer Array in the map function?

Posted by Sebastian Schelter <ss...@googlemail.com>.

You can use mapPartition, which allows you to apply the map function
elementwise to all elements of a partition. Here you can place custom code
around your function invocation that lets you reuse the array.

--sebastian
Am 19.11.2013 10:02 schrieb "Wenlei Xie" <we...@gmail.com>:

> Hi,
>
> I am trying to do some tasks with the following style map function:
>
> rdd.map { e =>
>     val a = new Array[Int](100)
>     ...Some calculation...
> }
>
> But here the array a is really just used as a temporary buffer and can be
> reused. I am wondering if I can avoid constructing it everytime? (As it
> might incur some overhead for JVM?) Would use an array outside the closure
> work?
>
> Thank you!
>
> Best,
> Wenlei
>
>

Re: Reuse the Buffer Array in the map function?

Posted by Harvey Feng <hy...@gmail.com>.

an example for Sebastian's response is:

rdd.mapPartitions { partitionIter =>
  val a = new Array[Int](100)
  partitionIter.map { e =>
    .. Some calculation that reuses `a` ...
  }
}

If the array were outside the map(), then that closure, the array and the
outer block (corresponding to the array reference variable's scope) will be
serialized and shipped once for each RDD task that gets executed. Since
there an RDD task is created for materializing each RDD partition, there
would still be an array used for each partition. Also, the serialized outer
block could be an entire class, which increases the size of each task,
which increases the scheduling latency...

On Tue, Nov 19, 2013 at 1:01 AM, Wenlei Xie <we...@gmail.com> wrote:

> Hi,
>
> I am trying to do some tasks with the following style map function:
>
> rdd.map { e =>
>     val a = new Array[Int](100)
>     ...Some calculation...
> }
>
> But here the array a is really just used as a temporary buffer and can be
> reused. I am wondering if I can avoid constructing it everytime? (As it
> might incur some overhead for JVM?) Would use an array outside the closure
> work?
>
> Thank you!
>
> Best,
> Wenlei
>
>