You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kai Londenberg <Lo...@nurago.com> on 2010/02/23 16:00:10 UTC

New or old Map/Reduce API ?

Hi ..

I'm currently trying to get information for a decision on which version of the Map/Reduce API to use for our Map/Reduce Jobs.  We have existing (production) code that uses the old API and some new code (non-production) that uses the new API, and we have some developers who will definitely not have much time to dig into Hadoop Sources to figure out how to do things right in the new API (instead of being able to look it up in a book), so the state of documentation does matter.

So far I got the following information:


-          The old API is deprecated and will be removed, but that will probably take at least a year

-          The new API does not provide really new functionality

-          There's library and contrib. code that has not been ported to the new API yet

-          Most of the existing thorough documentation (like Hadoop: the definitive Guide) covers the old API

-          Porting to the new API will probably become easier in future versions of Hadoop when more of the lib code and docs have been ported

So, what are your experiences with new vs. old API ?  Would you recommend to switch to the new API right now, or wait for a later release ?  Is it problematic to have applications using old and new API side by side ? How hard is it currently  to port old code to the new API ?

If these questions have been covered by some other thread already, please point me to it. I could not find much of a discussion browsing the mailing list archives, though.

Thanks in advance for any advice you can give,


Kai Londenberg

. . . . . . . . . . . . . . . . . . . . . . . .
Software Developer

nurago GmbH
applied research technologies
Kurt-Schumacher-Str. 24 . 30159 Hannover Tel. +49 511 213 866 . 0 Fax +49 511 213 866 . 22

londenberg@nurago.com<ma...@nurago.com> . www.nurago.com<http://www.nurago.com>

Geschäftsführer: Thomas Knauer
Amtsgericht Hannover: HRB 201817
UID (Vat)-No: DE 2540 787 09


Re: New or old Map/Reduce API ?

Posted by Patrick Angeles <pa...@cloudera.com>.
Hello Kai,

To answer your questions:

- Most of the missing stuff from the new API are convenience classes --
InputFormats, OutputFormats, etc. One very handy class that is missing from
the new API is MultipleOutputs which allows you to write multiple files in a
single pass.
- You cannot mix classes from the old API and the new API in the same
MapReduce job.
- However, you can run different jobs that use either the old or new API in
the same cluster, and their data inputs and outputs should be compatible.
- Figuring out how to do things right in the new versus the old API is
really a matter of having a template MapReduce job (e.g., WordCount) for
each API. The surrounding classes are different, but the way you write your
map and reduce functions are largely the same.
- Porting old code to the new API is not hard (especially with an IDE), but
it can get tedious. As an exercise, you could try porting WordCount from the
old API to the new one.
- It will probably take more than a year to fully remove the old APIs.
Today, the old API is marked as 'deprecated' which is really a misnomer
since the new API doesn't provide a full alternative to the old.

My recommendation would be to learn and use both. Personally, I find the new
API to be cleaner and more 'Java-like'. For someone just picking this stuff
up, the old APIs might be easier as they correlate exactly to the examples
in print.

Regards,

- Patrick

On Tue, Feb 23, 2010 at 10:00 AM, Kai Londenberg <Lo...@nurago.com>wrote:

> Hi ..
>
> I'm currently trying to get information for a decision on which version of
> the Map/Reduce API to use for our Map/Reduce Jobs.  We have existing
> (production) code that uses the old API and some new code (non-production)
> that uses the new API, and we have some developers who will definitely not
> have much time to dig into Hadoop Sources to figure out how to do things
> right in the new API (instead of being able to look it up in a book), so the
> state of documentation does matter.
>
> So far I got the following information:
>
>
> -          The old API is deprecated and will be removed, but that will
> probably take at least a year
>
> -          The new API does not provide really new functionality
>
> -          There's library and contrib. code that has not been ported to
> the new API yet
>
> -          Most of the existing thorough documentation (like Hadoop: the
> definitive Guide) covers the old API
>
> -          Porting to the new API will probably become easier in future
> versions of Hadoop when more of the lib code and docs have been ported
>
> So, what are your experiences with new vs. old API ?  Would you recommend
> to switch to the new API right now, or wait for a later release ?  Is it
> problematic to have applications using old and new API side by side ? How
> hard is it currently  to port old code to the new API ?
>
> If these questions have been covered by some other thread already, please
> point me to it. I could not find much of a discussion browsing the mailing
> list archives, though.
>
> Thanks in advance for any advice you can give,
>
>
> Kai Londenberg
>
> . . . . . . . . . . . . . . . . . . . . . . . .
> Software Developer
>
> nurago GmbH
> applied research technologies
> Kurt-Schumacher-Str. 24 . 30159 Hannover Tel. +49 511 213 866 . 0 Fax +49
> 511 213 866 . 22
>
> londenberg@nurago.com<ma...@nurago.com> . www.nurago.com<
> http://www.nurago.com>
>
> Geschäftsführer: Thomas Knauer
> Amtsgericht Hannover: HRB 201817
> UID (Vat)-No: DE 2540 787 09
>
>