You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mark Kerzner <ma...@gmail.com> on 2009/10/28 04:55:09 UTC

How to give consecutive numbers to output records?

Hi,

I need to number all output records consecutively, like, 1,2,3...

This is no problem with one reducer, making recordId an instance variable in
the Reducer class, and setting conf.setNumReduceTasks(1)

However, it is an architectural decision forced by processing need, where
the reducer becomes a bottleneck. Can I have a global variable for all
reducers, which would give each the next consecutive recordId? In the
database scenario, this would be the unique autokey. How to do it in
MapReduce?

Thank you

Re: How to give consecutive numbers to output records?

Posted by Mark Kerzner <ma...@gmail.com>.
Aaron, although your notes are not a ready solution, but they are a great
help.

Thank you,
Mark

On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aa...@cloudera.com> wrote:

> There is no in-MapReduce mechanism for cross-task synchronization. You'll
> need to use something like Zookeeper for this, or another external
> database.
> Note that this will greatly complicate your life.
>
> If I were you, I'd try to either redesign my pipeline elsewhere to
> eliminate
> this need, or maybe get really clever. For example, do your numbers need to
> be sequential, or just unique?
>
> If the latter, then take the byte offset into the reducer's current output
> file and combine that with the reducer id (e.g.,
> <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're
> all
> building unique sequences. If the former... rethink your pipeline? :)
>
> - Aaron
>
> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <ma...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I need to number all output records consecutively, like, 1,2,3...
> >
> > This is no problem with one reducer, making recordId an instance variable
> > in
> > the Reducer class, and setting conf.setNumReduceTasks(1)
> >
> > However, it is an architectural decision forced by processing need, where
> > the reducer becomes a bottleneck. Can I have a global variable for all
> > reducers, which would give each the next consecutive recordId? In the
> > database scenario, this would be the unique autokey. How to do it in
> > MapReduce?
> >
> > Thank you
> >
>

Re: How to give consecutive numbers to output records?

Posted by Aaron Kimball <aa...@cloudera.com>.
There is no in-MapReduce mechanism for cross-task synchronization. You'll
need to use something like Zookeeper for this, or another external database.
Note that this will greatly complicate your life.

If I were you, I'd try to either redesign my pipeline elsewhere to eliminate
this need, or maybe get really clever. For example, do your numbers need to
be sequential, or just unique?

If the latter, then take the byte offset into the reducer's current output
file and combine that with the reducer id (e.g.,
<current-byte-offset><zero-padded-reducer-id>) to guarantee that they're all
building unique sequences. If the former... rethink your pipeline? :)

- Aaron

On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <ma...@gmail.com> wrote:

> Hi,
>
> I need to number all output records consecutively, like, 1,2,3...
>
> This is no problem with one reducer, making recordId an instance variable
> in
> the Reducer class, and setting conf.setNumReduceTasks(1)
>
> However, it is an architectural decision forced by processing need, where
> the reducer becomes a bottleneck. Can I have a global variable for all
> reducers, which would give each the next consecutive recordId? In the
> database scenario, this would be the unique autokey. How to do it in
> MapReduce?
>
> Thank you
>