You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Michael Huelfenhaus <m....@davengo.com> on 2015/08/12 11:53:40 UTC

Udf Performance and Object Creation

Hello

I have a question about the programming of user defined functions, is it still like in old Stratosphere times the case that object creation should be avoided al all cost? Because in some of the examples there are now Tuples and other objects created before returning them.

I gonna have an at least 6 step streaming plan and I am going to use Pojos. Is it performance wise a big improvement to define one big pojo that can be used by all the steps or better to have smaller ones to send less data but create more objects.

Thanks
Michael

Re: Udf Performance and Object Creation

Posted by Hawin Jiang <ha...@gmail.com>.

Thanks Timo

That is a good interview question



Best regards
Hawin

On Thu, Aug 13, 2015 at 1:11 AM, Michael Huelfenhaus <
m.huelfenhaus@davengo.com> wrote:

> Hey Timo,
>
> yes that is what I needed to know.
>
> Thanks
> - Michael
>
> Am 12.08.2015 um 12:44 schrieb Timo Walther <tw...@apache.org>:
>
> > Hello Michael,
> >
> > every time you code a Java program you should avoid object creation if
> you want an efficient program, because every created object needs to be
> garbage collected later (which slows down your program performance).
> > You can have small Pojos, just try to avoid the call "new" in your
> functions:
> >
> > Instead of:
> >
> > class Mapper implements MapFunction<String,Pojo> {
> > public Pojo map(String s) {
> >    Pojo p = new Pojo();
> >    p.f = s;
> > }
> > }
> >
> > do:
> >
> > class Mapper implements MapFunction<String,Pojo> {
> > private Pojo p = new Pojo();
> > public Pojo map(String s) {
> >    p.f = s;
> > }
> > }
> >
> > Then an object is only created once per Mapper and not per record.
> >
> > Hope this helps.
> >
> > Regards,
> > Timo
> >
> >
> >
> > On 12.08.2015 11:53, Michael Huelfenhaus wrote:
> >> Hello
> >>
> >> I have a question about the programming of user defined functions, is
> it still like in old Stratosphere times the case that object creation
> should be avoided al all cost? Because in some of the examples there are
> now Tuples and other objects created before returning them.
> >>
> >> I gonna have an at least 6 step streaming plan and I am going to use
> Pojos. Is it performance wise a big improvement to define one big pojo that
> can be used by all the steps or better to have smaller ones to send less
> data but create more objects.
> >>
> >> Thanks
> >> Michael
> >
>
>

Re: Udf Performance and Object Creation

Posted by Michael Huelfenhaus <m....@davengo.com>.

Hey Timo,

yes that is what I needed to know.

Thanks
- Michael

Am 12.08.2015 um 12:44 schrieb Timo Walther <tw...@apache.org>:

> Hello Michael,
> 
> every time you code a Java program you should avoid object creation if you want an efficient program, because every created object needs to be garbage collected later (which slows down your program performance).
> You can have small Pojos, just try to avoid the call "new" in your functions:
> 
> Instead of:
> 
> class Mapper implements MapFunction<String,Pojo> {
> public Pojo map(String s) {
>    Pojo p = new Pojo();
>    p.f = s;
> }
> }
> 
> do:
> 
> class Mapper implements MapFunction<String,Pojo> {
> private Pojo p = new Pojo();
> public Pojo map(String s) {
>    p.f = s;
> }
> }
> 
> Then an object is only created once per Mapper and not per record.
> 
> Hope this helps.
> 
> Regards,
> Timo
> 
> 
> 
> On 12.08.2015 11:53, Michael Huelfenhaus wrote:
>> Hello
>> 
>> I have a question about the programming of user defined functions, is it still like in old Stratosphere times the case that object creation should be avoided al all cost? Because in some of the examples there are now Tuples and other objects created before returning them.
>> 
>> I gonna have an at least 6 step streaming plan and I am going to use Pojos. Is it performance wise a big improvement to define one big pojo that can be used by all the steps or better to have smaller ones to send less data but create more objects.
>> 
>> Thanks
>> Michael
>

Re: Udf Performance and Object Creation

Posted by Fabian Hueske <fh...@gmail.com>.

O sorry, Flavio!
I didn't see Hawins questions :-(

Thanks Stephan for picking up!

2015-08-14 17:43 GMT+02:00 Flavio Pompermaier <po...@okkam.it>:

> Any insight about these 2 questions..?
> On 12 Aug 2015 17:38, "Flavio Pompermaier" <po...@okkam.it> wrote:
>
>> This is something I've never understood in depth: isn't a mapper created
>> for each record?if it's created only once per task manager then it's not so
>> different from mapPartition..what I'm missing here?
>>
>> And then a more philosophic question: all big data framework requires
>> somehow to manage memory very efficiently (Flink has even though to reserve
>> a fraction of the entire memory in order to have control over it). Wouldn't
>> be simpler if java would finally release some APIs (even marked as unsafe,
>> it doesn't change theMat much) to allow for a full control of the
>> memory..?it will make a lot of sense for all big data platforms (at least
>> for non-UDF code...).
>>
>> Best,
>> Flavio
>> On 12 Aug 2015 12:44, "Timo Walther" <tw...@apache.org> wrote:
>>
>>> Hello Michael,
>>>
>>> every time you code a Java program you should avoid object creation if
>>> you want an efficient program, because every created object needs to be
>>> garbage collected later (which slows down your program performance).
>>> You can have small Pojos, just try to avoid the call "new" in your
>>> functions:
>>>
>>> Instead of:
>>>
>>> class Mapper implements MapFunction<String,Pojo> {
>>> public Pojo map(String s) {
>>>     Pojo p = new Pojo();
>>>     p.f = s;
>>> }
>>> }
>>>
>>> do:
>>>
>>> class Mapper implements MapFunction<String,Pojo> {
>>> private Pojo p = new Pojo();
>>> public Pojo map(String s) {
>>>     p.f = s;
>>> }
>>> }
>>>
>>> Then an object is only created once per Mapper and not per record.
>>>
>>> Hope this helps.
>>>
>>> Regards,
>>> Timo
>>>
>>>
>>>
>>> On 12.08.2015 11:53, Michael Huelfenhaus wrote:
>>>
>>>> Hello
>>>>
>>>> I have a question about the programming of user defined functions, is
>>>> it still like in old Stratosphere times the case that object creation
>>>> should be avoided al all cost? Because in some of the examples there are
>>>> now Tuples and other objects created before returning them.
>>>>
>>>> I gonna have an at least 6 step streaming plan and I am going to use
>>>> Pojos. Is it performance wise a big improvement to define one big pojo that
>>>> can be used by all the steps or better to have smaller ones to send less
>>>> data but create more objects.
>>>>
>>>> Thanks
>>>> Michael
>>>>
>>>
>>>

Re: Udf Performance and Object Creation

Posted by Stephan Ewen <se...@apache.org>.

Yes, map() is like a convenience function around mapPartition().

On Fri, Aug 14, 2015 at 6:09 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Hi Stephan thanks for the reply!
> Now it's more clear..if I understood correctly map and mapPartition are
> the same iff I have only one slot per task manager, right?
>
> I was convinced to have post those questions in this thread as 3rd or 4th
> message..isn't it?
> On 14 Aug 2015 17:57, "Stephan Ewen" <se...@apache.org> wrote:
>
>> Hi!
>>
>> (1) A mapper is created once per parallel task. So if you create a
>> program that runs a map() transformation with a parallelism of n, you will
>> have n mapper instances in the cluster. Some may be on the same
>> TaskManager, if the TaskManager has multiple slots.
>>
>> (2) I would really like that. But it means Java has to deal with both
>> managed and unmanaged memory at the same time, which is quite a heavy
>> addition. C# has some form of support for that.
>>
>> BTW: Where did you originally post these questions? I have not seen them
>> before...
>>
>> On Fri, Aug 14, 2015 at 5:43 PM, Flavio Pompermaier <pompermaier@okkam.it
>> > wrote:
>>
>>> Any insight about these 2 questions..?
>>> On 12 Aug 2015 17:38, "Flavio Pompermaier" <po...@okkam.it> wrote:
>>>
>>>> This is something I've never understood in depth: isn't a mapper
>>>> created for each record?if it's created only once per task manager then
>>>> it's not so different from mapPartition..what I'm missing here?
>>>>
>>>> And then a more philosophic question: all big data framework requires
>>>> somehow to manage memory very efficiently (Flink has even though to reserve
>>>> a fraction of the entire memory in order to have control over it). Wouldn't
>>>> be simpler if java would finally release some APIs (even marked as unsafe,
>>>> it doesn't change theMat much) to allow for a full control of the
>>>> memory..?it will make a lot of sense for all big data platforms (at least
>>>> for non-UDF code...).
>>>>
>>>> Best,
>>>> Flavio
>>>> On 12 Aug 2015 12:44, "Timo Walther" <tw...@apache.org> wrote:
>>>>
>>>>> Hello Michael,
>>>>>
>>>>> every time you code a Java program you should avoid object creation if
>>>>> you want an efficient program, because every created object needs to be
>>>>> garbage collected later (which slows down your program performance).
>>>>> You can have small Pojos, just try to avoid the call "new" in your
>>>>> functions:
>>>>>
>>>>> Instead of:
>>>>>
>>>>> class Mapper implements MapFunction<String,Pojo> {
>>>>> public Pojo map(String s) {
>>>>>     Pojo p = new Pojo();
>>>>>     p.f = s;
>>>>> }
>>>>> }
>>>>>
>>>>> do:
>>>>>
>>>>> class Mapper implements MapFunction<String,Pojo> {
>>>>> private Pojo p = new Pojo();
>>>>> public Pojo map(String s) {
>>>>>     p.f = s;
>>>>> }
>>>>> }
>>>>>
>>>>> Then an object is only created once per Mapper and not per record.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Regards,
>>>>> Timo
>>>>>
>>>>>
>>>>>
>>>>> On 12.08.2015 11:53, Michael Huelfenhaus wrote:
>>>>>
>>>>>> Hello
>>>>>>
>>>>>> I have a question about the programming of user defined functions, is
>>>>>> it still like in old Stratosphere times the case that object creation
>>>>>> should be avoided al all cost? Because in some of the examples there are
>>>>>> now Tuples and other objects created before returning them.
>>>>>>
>>>>>> I gonna have an at least 6 step streaming plan and I am going to use
>>>>>> Pojos. Is it performance wise a big improvement to define one big pojo that
>>>>>> can be used by all the steps or better to have smaller ones to send less
>>>>>> data but create more objects.
>>>>>>
>>>>>> Thanks
>>>>>> Michael
>>>>>>
>>>>>
>>>>>
>>

Re: Udf Performance and Object Creation

Posted by Flavio Pompermaier <po...@okkam.it>.

Hi Stephan thanks for the reply!
Now it's more clear..if I understood correctly map and mapPartition are the
same iff I have only one slot per task manager, right?

I was convinced to have post those questions in this thread as 3rd or 4th
message..isn't it?
On 14 Aug 2015 17:57, "Stephan Ewen" <se...@apache.org> wrote:

> Hi!
>
> (1) A mapper is created once per parallel task. So if you create a program
> that runs a map() transformation with a parallelism of n, you will have n
> mapper instances in the cluster. Some may be on the same TaskManager, if
> the TaskManager has multiple slots.
>
> (2) I would really like that. But it means Java has to deal with both
> managed and unmanaged memory at the same time, which is quite a heavy
> addition. C# has some form of support for that.
>
> BTW: Where did you originally post these questions? I have not seen them
> before...
>
> On Fri, Aug 14, 2015 at 5:43 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Any insight about these 2 questions..?
>> On 12 Aug 2015 17:38, "Flavio Pompermaier" <po...@okkam.it> wrote:
>>
>>> This is something I've never understood in depth: isn't a mapper created
>>> for each record?if it's created only once per task manager then it's not so
>>> different from mapPartition..what I'm missing here?
>>>
>>> And then a more philosophic question: all big data framework requires
>>> somehow to manage memory very efficiently (Flink has even though to reserve
>>> a fraction of the entire memory in order to have control over it). Wouldn't
>>> be simpler if java would finally release some APIs (even marked as unsafe,
>>> it doesn't change theMat much) to allow for a full control of the
>>> memory..?it will make a lot of sense for all big data platforms (at least
>>> for non-UDF code...).
>>>
>>> Best,
>>> Flavio
>>> On 12 Aug 2015 12:44, "Timo Walther" <tw...@apache.org> wrote:
>>>
>>>> Hello Michael,
>>>>
>>>> every time you code a Java program you should avoid object creation if
>>>> you want an efficient program, because every created object needs to be
>>>> garbage collected later (which slows down your program performance).
>>>> You can have small Pojos, just try to avoid the call "new" in your
>>>> functions:
>>>>
>>>> Instead of:
>>>>
>>>> class Mapper implements MapFunction<String,Pojo> {
>>>> public Pojo map(String s) {
>>>>     Pojo p = new Pojo();
>>>>     p.f = s;
>>>> }
>>>> }
>>>>
>>>> do:
>>>>
>>>> class Mapper implements MapFunction<String,Pojo> {
>>>> private Pojo p = new Pojo();
>>>> public Pojo map(String s) {
>>>>     p.f = s;
>>>> }
>>>> }
>>>>
>>>> Then an object is only created once per Mapper and not per record.
>>>>
>>>> Hope this helps.
>>>>
>>>> Regards,
>>>> Timo
>>>>
>>>>
>>>>
>>>> On 12.08.2015 11:53, Michael Huelfenhaus wrote:
>>>>
>>>>> Hello
>>>>>
>>>>> I have a question about the programming of user defined functions, is
>>>>> it still like in old Stratosphere times the case that object creation
>>>>> should be avoided al all cost? Because in some of the examples there are
>>>>> now Tuples and other objects created before returning them.
>>>>>
>>>>> I gonna have an at least 6 step streaming plan and I am going to use
>>>>> Pojos. Is it performance wise a big improvement to define one big pojo that
>>>>> can be used by all the steps or better to have smaller ones to send less
>>>>> data but create more objects.
>>>>>
>>>>> Thanks
>>>>> Michael
>>>>>
>>>>
>>>>
>

Re: Udf Performance and Object Creation

Posted by Stephan Ewen <se...@apache.org>.

Hi!

(1) A mapper is created once per parallel task. So if you create a program
that runs a map() transformation with a parallelism of n, you will have n
mapper instances in the cluster. Some may be on the same TaskManager, if
the TaskManager has multiple slots.

(2) I would really like that. But it means Java has to deal with both
managed and unmanaged memory at the same time, which is quite a heavy
addition. C# has some form of support for that.

BTW: Where did you originally post these questions? I have not seen them
before...

On Fri, Aug 14, 2015 at 5:43 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Any insight about these 2 questions..?
> On 12 Aug 2015 17:38, "Flavio Pompermaier" <po...@okkam.it> wrote:
>
>> This is something I've never understood in depth: isn't a mapper created
>> for each record?if it's created only once per task manager then it's not so
>> different from mapPartition..what I'm missing here?
>>
>> And then a more philosophic question: all big data framework requires
>> somehow to manage memory very efficiently (Flink has even though to reserve
>> a fraction of the entire memory in order to have control over it). Wouldn't
>> be simpler if java would finally release some APIs (even marked as unsafe,
>> it doesn't change theMat much) to allow for a full control of the
>> memory..?it will make a lot of sense for all big data platforms (at least
>> for non-UDF code...).
>>
>> Best,
>> Flavio
>> On 12 Aug 2015 12:44, "Timo Walther" <tw...@apache.org> wrote:
>>
>>> Hello Michael,
>>>
>>> every time you code a Java program you should avoid object creation if
>>> you want an efficient program, because every created object needs to be
>>> garbage collected later (which slows down your program performance).
>>> You can have small Pojos, just try to avoid the call "new" in your
>>> functions:
>>>
>>> Instead of:
>>>
>>> class Mapper implements MapFunction<String,Pojo> {
>>> public Pojo map(String s) {
>>>     Pojo p = new Pojo();
>>>     p.f = s;
>>> }
>>> }
>>>
>>> do:
>>>
>>> class Mapper implements MapFunction<String,Pojo> {
>>> private Pojo p = new Pojo();
>>> public Pojo map(String s) {
>>>     p.f = s;
>>> }
>>> }
>>>
>>> Then an object is only created once per Mapper and not per record.
>>>
>>> Hope this helps.
>>>
>>> Regards,
>>> Timo
>>>
>>>
>>>
>>> On 12.08.2015 11:53, Michael Huelfenhaus wrote:
>>>
>>>> Hello
>>>>
>>>> I have a question about the programming of user defined functions, is
>>>> it still like in old Stratosphere times the case that object creation
>>>> should be avoided al all cost? Because in some of the examples there are
>>>> now Tuples and other objects created before returning them.
>>>>
>>>> I gonna have an at least 6 step streaming plan and I am going to use
>>>> Pojos. Is it performance wise a big improvement to define one big pojo that
>>>> can be used by all the steps or better to have smaller ones to send less
>>>> data but create more objects.
>>>>
>>>> Thanks
>>>> Michael
>>>>
>>>
>>>

Re: Udf Performance and Object Creation

Posted by Fabian Hueske <fh...@gmail.com>.

I think Timo answered both questions (quoting Michael: "Hey Timo, yes that
is what I needed to know. Thanks").

Maybe one more comment. The motivation of the examples is not the best
performance but to showcase Flink's APIs and concepts.

Best, Fabian

2015-08-14 17:43 GMT+02:00 Flavio Pompermaier <po...@okkam.it>:

> Any insight about these 2 questions..?
> On 12 Aug 2015 17:38, "Flavio Pompermaier" <po...@okkam.it> wrote:
>
>> This is something I've never understood in depth: isn't a mapper created
>> for each record?if it's created only once per task manager then it's not so
>> different from mapPartition..what I'm missing here?
>>
>> And then a more philosophic question: all big data framework requires
>> somehow to manage memory very efficiently (Flink has even though to reserve
>> a fraction of the entire memory in order to have control over it). Wouldn't
>> be simpler if java would finally release some APIs (even marked as unsafe,
>> it doesn't change theMat much) to allow for a full control of the
>> memory..?it will make a lot of sense for all big data platforms (at least
>> for non-UDF code...).
>>
>> Best,
>> Flavio
>> On 12 Aug 2015 12:44, "Timo Walther" <tw...@apache.org> wrote:
>>
>>> Hello Michael,
>>>
>>> every time you code a Java program you should avoid object creation if
>>> you want an efficient program, because every created object needs to be
>>> garbage collected later (which slows down your program performance).
>>> You can have small Pojos, just try to avoid the call "new" in your
>>> functions:
>>>
>>> Instead of:
>>>
>>> class Mapper implements MapFunction<String,Pojo> {
>>> public Pojo map(String s) {
>>>     Pojo p = new Pojo();
>>>     p.f = s;
>>> }
>>> }
>>>
>>> do:
>>>
>>> class Mapper implements MapFunction<String,Pojo> {
>>> private Pojo p = new Pojo();
>>> public Pojo map(String s) {
>>>     p.f = s;
>>> }
>>> }
>>>
>>> Then an object is only created once per Mapper and not per record.
>>>
>>> Hope this helps.
>>>
>>> Regards,
>>> Timo
>>>
>>>
>>>
>>> On 12.08.2015 11:53, Michael Huelfenhaus wrote:
>>>
>>>> Hello
>>>>
>>>> I have a question about the programming of user defined functions, is
>>>> it still like in old Stratosphere times the case that object creation
>>>> should be avoided al all cost? Because in some of the examples there are
>>>> now Tuples and other objects created before returning them.
>>>>
>>>> I gonna have an at least 6 step streaming plan and I am going to use
>>>> Pojos. Is it performance wise a big improvement to define one big pojo that
>>>> can be used by all the steps or better to have smaller ones to send less
>>>> data but create more objects.
>>>>
>>>> Thanks
>>>> Michael
>>>>
>>>
>>>

Re: Udf Performance and Object Creation

Posted by Flavio Pompermaier <po...@okkam.it>.

Any insight about these 2 questions..?
On 12 Aug 2015 17:38, "Flavio Pompermaier" <po...@okkam.it> wrote:

> This is something I've never understood in depth: isn't a mapper created
> for each record?if it's created only once per task manager then it's not so
> different from mapPartition..what I'm missing here?
>
> And then a more philosophic question: all big data framework requires
> somehow to manage memory very efficiently (Flink has even though to reserve
> a fraction of the entire memory in order to have control over it). Wouldn't
> be simpler if java would finally release some APIs (even marked as unsafe,
> it doesn't change theMat much) to allow for a full control of the
> memory..?it will make a lot of sense for all big data platforms (at least
> for non-UDF code...).
>
> Best,
> Flavio
> On 12 Aug 2015 12:44, "Timo Walther" <tw...@apache.org> wrote:
>
>> Hello Michael,
>>
>> every time you code a Java program you should avoid object creation if
>> you want an efficient program, because every created object needs to be
>> garbage collected later (which slows down your program performance).
>> You can have small Pojos, just try to avoid the call "new" in your
>> functions:
>>
>> Instead of:
>>
>> class Mapper implements MapFunction<String,Pojo> {
>> public Pojo map(String s) {
>>     Pojo p = new Pojo();
>>     p.f = s;
>> }
>> }
>>
>> do:
>>
>> class Mapper implements MapFunction<String,Pojo> {
>> private Pojo p = new Pojo();
>> public Pojo map(String s) {
>>     p.f = s;
>> }
>> }
>>
>> Then an object is only created once per Mapper and not per record.
>>
>> Hope this helps.
>>
>> Regards,
>> Timo
>>
>>
>>
>> On 12.08.2015 11:53, Michael Huelfenhaus wrote:
>>
>>> Hello
>>>
>>> I have a question about the programming of user defined functions, is it
>>> still like in old Stratosphere times the case that object creation should
>>> be avoided al all cost? Because in some of the examples there are now
>>> Tuples and other objects created before returning them.
>>>
>>> I gonna have an at least 6 step streaming plan and I am going to use
>>> Pojos. Is it performance wise a big improvement to define one big pojo that
>>> can be used by all the steps or better to have smaller ones to send less
>>> data but create more objects.
>>>
>>> Thanks
>>> Michael
>>>
>>
>>

Re: Udf Performance and Object Creation

Posted by Timo Walther <tw...@apache.org>.

Hello Michael,

every time you code a Java program you should avoid object creation if 
you want an efficient program, because every created object needs to be 
garbage collected later (which slows down your program performance).
You can have small Pojos, just try to avoid the call "new" in your 
functions:

Instead of:

class Mapper implements MapFunction<String,Pojo> {
public Pojo map(String s) {
     Pojo p = new Pojo();
     p.f = s;
}
}

do:

class Mapper implements MapFunction<String,Pojo> {
private Pojo p = new Pojo();
public Pojo map(String s) {
     p.f = s;
}
}

Then an object is only created once per Mapper and not per record.

Hope this helps.

Regards,
Timo

On 12.08.2015 11:53, Michael Huelfenhaus wrote:
> Hello
>
> I have a question about the programming of user defined functions, is it still like in old Stratosphere times the case that object creation should be avoided al all cost? Because in some of the examples there are now Tuples and other objects created before returning them.
>
> I gonna have an at least 6 step streaming plan and I am going to use Pojos. Is it performance wise a big improvement to define one big pojo that can be used by all the steps or better to have smaller ones to send less data but create more objects.
>
> Thanks
> Michael