You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by harry lippy <ha...@gmail.com> on 2011/09/16 15:26:35 UTC

Question about how input data is presented to the map function

Hi from a total noob:

I'm working my way through 'Hadoop:  The Definitive Guide', by Tom White.
 In chapter 2, he works through an example of taking weather data from the
NCDC and computing the maximum temperature for the given years.  There is a
small sample test file to try out the code, and it looks like:

0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999

In the middle of page 19, he says

"These lines are presented to the map function as the key-value pairs:

(0,
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999)
(106,
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999)
(212,
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999)
(318,
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999)
(424,
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999)

The keys are file offsets into the input file.  My question:  how did the
'are presented to the map function as key-value pairs' happen?  I've run the
example on the input file using the java Mapper, Reducer, and the code that
runs the job - none of which seems, to my novice eye, to massage the input
from the file to the map function in the (file offset, line of data from
file) key-value format - and the results are correct.

Does hadoop automagically create key-value pairs of this format (file
offset, line of data from file)?  If so, is there a way to get hadoop to
present the data to the map function in a different format?

I should probably finish reading the book, as my question will probably be
answered there, but I hate moving forward with the feeling that I am missing
something.

Thanks, everybody!

Shaun

Re: Question about how input data is presented to the map function

Posted by John Armstrong <jo...@ccri.com>.
On Fri, 16 Sep 2011 08:26:35 -0500, harry lippy <ha...@gmail.com>
wrote:
>  The keys are file offsets into the input file.  My question:  how did
the
> 'are presented to the map function as key-value pairs' happen?  I've run
> the
> example on the input file using the java Mapper, Reducer, and the code
that
> runs the job - none of which seems, to my novice eye, to massage the
input
> from the file to the map function in the (file offset, line of data from
> file) key-value format - and the results are correct.

There are actually MANY classes in the framework floating around, most of
which you Don't Need to Know About on a day-to-day basis.  One of them is
called an InputFormat, which handles getting input and parsing it into
records.  These classes can all be replaced with appropriate
configurations, but the defaults are usually pretty good for most purposes.