You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Geoffry Roberts <ge...@gmail.com> on 2011/06/17 22:30:53 UTC

Mystery, A Tale of Two Reducers

All,

I have come across a situation that I don't understand.

*First Reducer:

*Behold the first of two reducers.  A fragment of it's output follows.
Simple no?  It doesn't do anything.  I've highlighted two records from the
output.  Keep them in mind.  Now lets look at the second reducer.
*
*protected void reduce(Text key, Iterable<Text> visitors, Context ctx)
 throws IOException, InterruptedException {
    for (Text visitor : visitors) {
       ctx.write(key, visitor);
    }
 }

2005-09-16=33614    42340108    *more==>*
2005-09-16=33614    42340106    *more==>*
*2005-09-16=33614    42340113    more==>*
2005-09-16=44135    42324490    *more==>*
2005-09-16=44135    42339700    *more==>*
...
*2005-09-16=44135    42324489    more==>*


*Second Reducer:*

This is a variation on the reducer from above.  A fragment of it's output
follows.  The difference is I add all visitors to a list then I iterate
through the list to produce my output.  Remember the two highlighted records
from above? They are now showing up in the output as duplicates and the
other records appear to be missing.  Why?  I have never seen an ArrayList
behave like this.  It must have something to do with hadoop.

I have a reasons for using the list.  One such reason is that I must have a
full count of all visitors before I can do my output, but I spare you.

To my mind, this second reducer should output the same as the first.

protected void reduce(Text key, Iterable<Text> visitors, Context ctx)
throws IOException, InterruptedException {
    List<Text> list = new ArrayList<Text>();
    for (Text visitor : visitors) {
        list.add(visitor);
    }
    for (Text visitor : list) {
        ctx.write(key, visitor);
    }
}

2005-09-16=33614    42340113    *more==>*
2005-09-16=33614    42340113    *more==>*
2005-09-16=33614    42340113    *more==>*
2005-09-16=44135    42324489    *more==>*
2005-09-16=44135    42324489    *more==>*

Thanks in advance

-- 
Geoffry Roberts

Re: Mystery, A Tale of Two Reducers

Posted by Geoffry Roberts <ge...@gmail.com>.
This is for the edification of the group.

The clone solution worked.  Here's how I handled it.

Second Reducer (redux) :

protected void reduce(Text key, Iterable<Text> visitors, Context ctx)
throws IOException, InterruptedException {

List<Text> list = new ArrayList<Text>();
for (Text visitor : visitors) {
         list.add(new Text(visitor));  // Create a new visitor.
     }
     for (Text visitor : list) {
         ctx.write(key, visitor);
     }
 }

Life is good again.

On 17 June 2011 13:38, Harsh J <ha...@cloudera.com> wrote:

> Geoffry,
>
> The problem here is that the Reducer in Hadoop reuses the same
> container object to pass on all values and keys. Thus, what you're
> really holding in your second reducer's code are "References" to this
> object -> Which upon writing will all be a mess of duplicates and what
> not cause they are all gonna be referring to the last gotten value
> every iteration.
>
> The solution, when you want to persist a particular key or value
> object, is to .clone() it into the list so that the list does store
> real, new objects in it and not multiple references of the same
> object.
>
> On Sat, Jun 18, 2011 at 2:00 AM, Geoffry Roberts
> <ge...@gmail.com> wrote:
> > All,
> >
> > I have come across a situation that I don't understand.
> >
> > First Reducer:
> >
> > Behold the first of two reducers.  A fragment of it's output follows.
> > Simple no?  It doesn't do anything.  I've highlighted two records from
> the
> > output.  Keep them in mind.  Now lets look at the second reducer.
> >
> > protected void reduce(Text key, Iterable<Text> visitors, Context ctx)
> >  throws IOException, InterruptedException {
> >     for (Text visitor : visitors) {
> >        ctx.write(key, visitor);
> >     }
> >  }
> >
> > 2005-09-16=33614    42340108    more==>
> > 2005-09-16=33614    42340106    more==>
> > 2005-09-16=33614    42340113    more==>
> > 2005-09-16=44135    42324490    more==>
> > 2005-09-16=44135    42339700    more==>
> > ...
> > 2005-09-16=44135    42324489    more==>
> >
> >
> > Second Reducer:
> >
> > This is a variation on the reducer from above.  A fragment of it's output
> > follows.  The difference is I add all visitors to a list then I iterate
> > through the list to produce my output.  Remember the two highlighted
> records
> > from above? They are now showing up in the output as duplicates and the
> > other records appear to be missing.  Why?  I have never seen an ArrayList
> > behave like this.  It must have something to do with hadoop.
> >
> > I have a reasons for using the list.  One such reason is that I must have
> a
> > full count of all visitors before I can do my output, but I spare you.
> >
> > To my mind, this second reducer should output the same as the first.
> >
> > protected void reduce(Text key, Iterable<Text> visitors, Context ctx)
> > throws IOException, InterruptedException {
> >     List<Text> list = new ArrayList<Text>();
> >     for (Text visitor : visitors) {
> >         list.add(visitor);
> >     }
> >     for (Text visitor : list) {
> >         ctx.write(key, visitor);
> >     }
> > }
> >
> > 2005-09-16=33614    42340113    more==>
> > 2005-09-16=33614    42340113    more==>
> > 2005-09-16=33614    42340113    more==>
> > 2005-09-16=44135    42324489    more==>
> > 2005-09-16=44135    42324489    more==>
> >
> > Thanks in advance
> >
> > --
> > Geoffry Roberts
> >
> >
>
>
>
> --
> Harsh J
>



-- 
Geoffry Roberts

Re: Mystery, A Tale of Two Reducers

Posted by Harsh J <ha...@cloudera.com>.
Geoffry,

The problem here is that the Reducer in Hadoop reuses the same
container object to pass on all values and keys. Thus, what you're
really holding in your second reducer's code are "References" to this
object -> Which upon writing will all be a mess of duplicates and what
not cause they are all gonna be referring to the last gotten value
every iteration.

The solution, when you want to persist a particular key or value
object, is to .clone() it into the list so that the list does store
real, new objects in it and not multiple references of the same
object.

On Sat, Jun 18, 2011 at 2:00 AM, Geoffry Roberts
<ge...@gmail.com> wrote:
> All,
>
> I have come across a situation that I don't understand.
>
> First Reducer:
>
> Behold the first of two reducers.  A fragment of it's output follows.
> Simple no?  It doesn't do anything.  I've highlighted two records from the
> output.  Keep them in mind.  Now lets look at the second reducer.
>
> protected void reduce(Text key, Iterable<Text> visitors, Context ctx)
>  throws IOException, InterruptedException {
>     for (Text visitor : visitors) {
>        ctx.write(key, visitor);
>     }
>  }
>
> 2005-09-16=33614    42340108    more==>
> 2005-09-16=33614    42340106    more==>
> 2005-09-16=33614    42340113    more==>
> 2005-09-16=44135    42324490    more==>
> 2005-09-16=44135    42339700    more==>
> ...
> 2005-09-16=44135    42324489    more==>
>
>
> Second Reducer:
>
> This is a variation on the reducer from above.  A fragment of it's output
> follows.  The difference is I add all visitors to a list then I iterate
> through the list to produce my output.  Remember the two highlighted records
> from above? They are now showing up in the output as duplicates and the
> other records appear to be missing.  Why?  I have never seen an ArrayList
> behave like this.  It must have something to do with hadoop.
>
> I have a reasons for using the list.  One such reason is that I must have a
> full count of all visitors before I can do my output, but I spare you.
>
> To my mind, this second reducer should output the same as the first.
>
> protected void reduce(Text key, Iterable<Text> visitors, Context ctx)
> throws IOException, InterruptedException {
>     List<Text> list = new ArrayList<Text>();
>     for (Text visitor : visitors) {
>         list.add(visitor);
>     }
>     for (Text visitor : list) {
>         ctx.write(key, visitor);
>     }
> }
>
> 2005-09-16=33614    42340113    more==>
> 2005-09-16=33614    42340113    more==>
> 2005-09-16=33614    42340113    more==>
> 2005-09-16=44135    42324489    more==>
> 2005-09-16=44135    42324489    more==>
>
> Thanks in advance
>
> --
> Geoffry Roberts
>
>



-- 
Harsh J