You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Vyacheslav Zholudev <vy...@gmail.com> on 2011/08/03 09:18:16 UTC

How does Hadoop reuse the objects?

Hi all,

I'm using Avro as a serialization format and assume I have a generated specific class FOO that I use as a Mapper output format:

class FOO {
  int a;
  List<BAR> barList;
}

where BAR is another generated specific Java class.

When I iterate over "Iterable<FOO> values" in the Reducer it is clear that the same object of class FOO is reused, i.e.
FOO foo1 = values.iterator.next();
FOO foo2 = values.iterator.next();
assertThat(foo1 == foo2, is (true));

So I have the following questions:
1) Is the list barList reused over the next() calls?
2) If yes, can the objects that are in the barList be reused? For example, if the first time next() is called, the list contains two BAR objects, the next time next() is called the barList contains 3 objects and 2 of them are equal by reference to the two from the list of the first next() call. In other words, does Hadoop maintain some sort of "object pool"?
3) Why do not AvroTools  generate clone() methods since it would be quite straightforward and more importantly useful given that objects are reused? 

Thanks a lot in advance!

Vyacheslav




Re: How does Hadoop reuse the objects?

Posted by Joey Echeverria <jo...@cloudera.com>.
Wow, I didn't expect that. That's nastier than usual. I would think
that cloning by serializing/deserializing would be unnecessarily slow.
I would file a JIRA with Avro asking for a clone() or copy constructor
in generated code.

-Joey

On Thu, Aug 4, 2011 at 5:07 PM, Vyacheslav Zholudev
<vy...@gmail.com> wrote:
> Just sharing my today's discovery:
> Hadoop also reuses objects in internal lists, in my example the BAR objects.
> That is if the first FOO object has two BAR objects in the list, then the
> second FOO object will contain the same (equal by reference) first two BAR
> objects in the list. So in case of Avro it would be good if auto-generated
> code implemented a 'clone' method.
> Btw, is it good to clone avro-specific objects by serializing/deserializing
> using SpecificDatum{Writer|Reader}?
> Vyacheslav
>
> On 4 August 2011 21:35, <Mi...@emc.com> wrote:
>>
>> HADOOP-2399 has caused a lot of problems for users so far, and the saga
>> still continues :-(
>>
>> I remember spending 18 straight hours in 2008 with a user debugging this
>> issue.
>>
>> - milind
>>
>> ---
>> Milind Bhandarkar
>> Greenplum Labs, EMC
>> (Disclaimer: Opinions expressed in this email are those of the author, and
>> do
>> not necessarily represent the views of any organization, past or present,
>> the author might be affiliated with.)
>>
>>
>>
>>
>> On 8/3/11 4:19 AM, "Joey Echeverria" <jo...@cloudera.com> wrote:
>>
>> >Hadoop reuses objects as an optimization. If you need to keep a copy
>> >in memory, you need to call clone yourself. I've never used Avro, but
>> >my guess is that the BARs are not reused, only the FOO.
>> >
>> >-Joey
>> >
>> >On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev
>> ><vy...@gmail.com> wrote:
>> >> Hi all,
>> >>
>> >> I'm using Avro as a serialization format and assume I have a generated
>> >>specific class FOO that I use as a Mapper output format:
>> >>
>> >> class FOO {
>> >>  int a;
>> >>  List<BAR> barList;
>> >> }
>> >>
>> >> where BAR is another generated specific Java class.
>> >>
>> >> When I iterate over "Iterable<FOO> values" in the Reducer it is clear
>> >>that the same object of class FOO is reused, i.e.
>> >> FOO foo1 = values.iterator.next();
>> >> FOO foo2 = values.iterator.next();
>> >> assertThat(foo1 == foo2, is (true));
>> >>
>> >> So I have the following questions:
>> >> 1) Is the list barList reused over the next() calls?
>> >> 2) If yes, can the objects that are in the barList be reused? For
>> >>example, if the first time next() is called, the list contains two BAR
>> >>objects, the next time next() is called the barList contains 3 objects
>> >>and 2 of them are equal by reference to the two from the list of the
>> >>first next() call. In other words, does Hadoop maintain some sort of
>> >>"object pool"?
>> >> 3) Why do not AvroTools  generate clone() methods since it would be
>> >>quite straightforward and more importantly useful given that objects are
>> >>reused?
>> >>
>> >> Thanks a lot in advance!
>> >>
>> >> Vyacheslav
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >--
>> >Joseph Echeverria
>> >Cloudera, Inc.
>> >443.305.9434
>> >
>>
>
>
>
> --
> Best,
> Vyacheslav Zholudev
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: How does Hadoop reuse the objects?

Posted by Vyacheslav Zholudev <vy...@gmail.com>.
Just sharing my today's discovery:
Hadoop also reuses objects in internal lists, in my example the BAR objects.
That is if the first FOO object has two BAR objects in the list, then the
second FOO object will contain the same (equal by reference) first two BAR
objects in the list. So in case of Avro it would be good if auto-generated
code implemented a 'clone' method.
Btw, is it good to clone avro-specific objects by serializing/deserializing
using SpecificDatum{Writer|Reader}?

Vyacheslav


On 4 August 2011 21:35, <Mi...@emc.com> wrote:

> HADOOP-2399 has caused a lot of problems for users so far, and the saga
> still continues :-(
>
> I remember spending 18 straight hours in 2008 with a user debugging this
> issue.
>
> - milind
>
> ---
> Milind Bhandarkar
> Greenplum Labs, EMC
> (Disclaimer: Opinions expressed in this email are those of the author, and
> do
> not necessarily represent the views of any organization, past or present,
> the author might be affiliated with.)
>
>
>
>
> On 8/3/11 4:19 AM, "Joey Echeverria" <jo...@cloudera.com> wrote:
>
> >Hadoop reuses objects as an optimization. If you need to keep a copy
> >in memory, you need to call clone yourself. I've never used Avro, but
> >my guess is that the BARs are not reused, only the FOO.
> >
> >-Joey
> >
> >On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev
> ><vy...@gmail.com> wrote:
> >> Hi all,
> >>
> >> I'm using Avro as a serialization format and assume I have a generated
> >>specific class FOO that I use as a Mapper output format:
> >>
> >> class FOO {
> >>  int a;
> >>  List<BAR> barList;
> >> }
> >>
> >> where BAR is another generated specific Java class.
> >>
> >> When I iterate over "Iterable<FOO> values" in the Reducer it is clear
> >>that the same object of class FOO is reused, i.e.
> >> FOO foo1 = values.iterator.next();
> >> FOO foo2 = values.iterator.next();
> >> assertThat(foo1 == foo2, is (true));
> >>
> >> So I have the following questions:
> >> 1) Is the list barList reused over the next() calls?
> >> 2) If yes, can the objects that are in the barList be reused? For
> >>example, if the first time next() is called, the list contains two BAR
> >>objects, the next time next() is called the barList contains 3 objects
> >>and 2 of them are equal by reference to the two from the list of the
> >>first next() call. In other words, does Hadoop maintain some sort of
> >>"object pool"?
> >> 3) Why do not AvroTools  generate clone() methods since it would be
> >>quite straightforward and more importantly useful given that objects are
> >>reused?
> >>
> >> Thanks a lot in advance!
> >>
> >> Vyacheslav
> >>
> >>
> >>
> >>
> >
> >
> >
> >--
> >Joseph Echeverria
> >Cloudera, Inc.
> >443.305.9434
> >
>
>


-- 
Best,
Vyacheslav Zholudev

Re: How does Hadoop reuse the objects?

Posted by Mi...@emc.com.
HADOOP-2399 has caused a lot of problems for users so far, and the saga
still continues :-(

I remember spending 18 straight hours in 2008 with a user debugging this
issue.

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do
not necessarily represent the views of any organization, past or present,
the author might be affiliated with.)




On 8/3/11 4:19 AM, "Joey Echeverria" <jo...@cloudera.com> wrote:

>Hadoop reuses objects as an optimization. If you need to keep a copy
>in memory, you need to call clone yourself. I've never used Avro, but
>my guess is that the BARs are not reused, only the FOO.
>
>-Joey
>
>On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev
><vy...@gmail.com> wrote:
>> Hi all,
>>
>> I'm using Avro as a serialization format and assume I have a generated
>>specific class FOO that I use as a Mapper output format:
>>
>> class FOO {
>>  int a;
>>  List<BAR> barList;
>> }
>>
>> where BAR is another generated specific Java class.
>>
>> When I iterate over "Iterable<FOO> values" in the Reducer it is clear
>>that the same object of class FOO is reused, i.e.
>> FOO foo1 = values.iterator.next();
>> FOO foo2 = values.iterator.next();
>> assertThat(foo1 == foo2, is (true));
>>
>> So I have the following questions:
>> 1) Is the list barList reused over the next() calls?
>> 2) If yes, can the objects that are in the barList be reused? For
>>example, if the first time next() is called, the list contains two BAR
>>objects, the next time next() is called the barList contains 3 objects
>>and 2 of them are equal by reference to the two from the list of the
>>first next() call. In other words, does Hadoop maintain some sort of
>>"object pool"?
>> 3) Why do not AvroTools  generate clone() methods since it would be
>>quite straightforward and more importantly useful given that objects are
>>reused?
>>
>> Thanks a lot in advance!
>>
>> Vyacheslav
>>
>>
>>
>>
>
>
>
>-- 
>Joseph Echeverria
>Cloudera, Inc.
>443.305.9434
>


Re: How does Hadoop reuse the objects?

Posted by Joey Echeverria <jo...@cloudera.com>.
Hadoop reuses objects as an optimization. If you need to keep a copy
in memory, you need to call clone yourself. I've never used Avro, but
my guess is that the BARs are not reused, only the FOO.

-Joey

On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev
<vy...@gmail.com> wrote:
> Hi all,
>
> I'm using Avro as a serialization format and assume I have a generated specific class FOO that I use as a Mapper output format:
>
> class FOO {
>  int a;
>  List<BAR> barList;
> }
>
> where BAR is another generated specific Java class.
>
> When I iterate over "Iterable<FOO> values" in the Reducer it is clear that the same object of class FOO is reused, i.e.
> FOO foo1 = values.iterator.next();
> FOO foo2 = values.iterator.next();
> assertThat(foo1 == foo2, is (true));
>
> So I have the following questions:
> 1) Is the list barList reused over the next() calls?
> 2) If yes, can the objects that are in the barList be reused? For example, if the first time next() is called, the list contains two BAR objects, the next time next() is called the barList contains 3 objects and 2 of them are equal by reference to the two from the list of the first next() call. In other words, does Hadoop maintain some sort of "object pool"?
> 3) Why do not AvroTools  generate clone() methods since it would be quite straightforward and more importantly useful given that objects are reused?
>
> Thanks a lot in advance!
>
> Vyacheslav
>
>
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434