You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by praveenesh kumar <pr...@gmail.com> on 2011/05/05 13:31:53 UTC

How hadoop parse input files into (Key,Value) pairs ??

Hi,

As we know hadoop mapper takes input as (Key,Value) pairs and generate
intermediate (Key,Value) pairs and usually we give input to our Mapper as a
text file.
How hadoop understand this and parse our input text file into (Key,Value)
Pairs

Usually our mapper looks like  --
*public* *void* map(LongWritable key, Text value,OutputCollector<Text, Text>
outputCollector, Reporter reporter) *throws* IOException {

String word = value.toString();

//Some lines of code

}

So if I pass any text file as input, it is taking every line as VALUE to
Mapper..on which I will do some processing and put it to OutputCollector.
But how hadoop parsed my text file into ( Key,Value ) pair and how can we
tell hadoop what (key,value) it should give to mapper ??

Thanks.

Re: How hadoop parse input files into (Key,Value) pairs ??

Posted by praveenesh kumar <pr...@gmail.com>.

Hi,
So I have a file in which the records are comma separated (
Record1,Record2). I want to make first record (Record1 as key) and Record2
as value.
I am using hadoop 0.20-append version.
I am looking forward to use KeyValueTextInputFormat and then setting
key.value.separator.in.input.line with ",".Is it possible with
hadoop-0.20-append ? I am not able to do that.
Any help ?

Thanks,
Praveenesh

On Mon, May 23, 2011 at 3:45 AM, Mark question <ma...@gmail.com> wrote:

> The case your talking about is when you use FileInputFormat ... So usually
> the InputFormat Interface is the one responsible for that.
>
> For FileInputFormat, it uses a LineRecordReader which will take your text
> file and assigns key to be the offset within your text file and value to be
> the line (until '\n') is seen.
>
> If you want to use other InputFormats check its API and pick what is
> suitable for you. In my case, I'm hocked with SequenceFileInputFormat where
> my input files are <key,value> records written by a regular java program
> (or
> parser). Then my Hadoop job will look at the keys and values that I wrote.
>
> I hope this helps a little,
> Mark
>
> On Thu, May 5, 2011 at 4:31 AM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
>
> > Hi,
> >
> > As we know hadoop mapper takes input as (Key,Value) pairs and generate
> > intermediate (Key,Value) pairs and usually we give input to our Mapper as
> a
> > text file.
> > How hadoop understand this and parse our input text file into (Key,Value)
> > Pairs
> >
> > Usually our mapper looks like  --
> > *public* *void* map(LongWritable key, Text value,OutputCollector<Text,
> > Text>
> > outputCollector, Reporter reporter) *throws* IOException {
> >
> > String word = value.toString();
> >
> > //Some lines of code
> >
> > }
> >
> > So if I pass any text file as input, it is taking every line as VALUE to
> > Mapper..on which I will do some processing and put it to OutputCollector.
> > But how hadoop parsed my text file into ( Key,Value ) pair and how can we
> > tell hadoop what (key,value) it should give to mapper ??
> >
> > Thanks.
> >
>

Re: How hadoop parse input files into (Key,Value) pairs ??

Posted by Mark question <ma...@gmail.com>.

The case your talking about is when you use FileInputFormat ... So usually
the InputFormat Interface is the one responsible for that.

For FileInputFormat, it uses a LineRecordReader which will take your text
file and assigns key to be the offset within your text file and value to be
the line (until '\n') is seen.

If you want to use other InputFormats check its API and pick what is
suitable for you. In my case, I'm hocked with SequenceFileInputFormat where
my input files are <key,value> records written by a regular java program (or
parser). Then my Hadoop job will look at the keys and values that I wrote.

I hope this helps a little,
Mark

On Thu, May 5, 2011 at 4:31 AM, praveenesh kumar <pr...@gmail.com>wrote:

> Hi,
>
> As we know hadoop mapper takes input as (Key,Value) pairs and generate
> intermediate (Key,Value) pairs and usually we give input to our Mapper as a
> text file.
> How hadoop understand this and parse our input text file into (Key,Value)
> Pairs
>
> Usually our mapper looks like  --
> *public* *void* map(LongWritable key, Text value,OutputCollector<Text,
> Text>
> outputCollector, Reporter reporter) *throws* IOException {
>
> String word = value.toString();
>
> //Some lines of code
>
> }
>
> So if I pass any text file as input, it is taking every line as VALUE to
> Mapper..on which I will do some processing and put it to OutputCollector.
> But how hadoop parsed my text file into ( Key,Value ) pair and how can we
> tell hadoop what (key,value) it should give to mapper ??
>
> Thanks.
>

Re: How hadoop parse input files into (Key,Value) pairs ??

Posted by Joey Echeverria <jo...@cloudera.com>.

Hadoop uses an InputFormat class to parse files and generate key,
value pairs for your Mapper. An InputFormat is any class which extends
the base abstract class:

http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/InputFormat.html

The default InputFormat parse text files generating keys which are
byte offsets and values which are complete lines of text:

http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/InputFormat.html

You can write your own InputFormat and configure your job to use it by
calling setInputFormat() on your Job before submitting it:

http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/Job.html#setInputFormatClass(java.lang.Class)

Hope that helps.

-Joey

P.S. I moved this over to the mapreduce-user alias since it's
MapReduce specific.

On Thu, May 5, 2011 at 7:31 AM, praveenesh kumar <pr...@gmail.com> wrote:
> Hi,
>
> As we know hadoop mapper takes input as (Key,Value) pairs and generate
> intermediate (Key,Value) pairs and usually we give input to our Mapper as a
> text file.
> How hadoop understand this and parse our input text file into (Key,Value)
> Pairs
>
> Usually our mapper looks like  --
> *public* *void* map(LongWritable key, Text value,OutputCollector<Text, Text>
> outputCollector, Reporter reporter) *throws* IOException {
>
> String word = value.toString();
>
> //Some lines of code
>
> }
>
> So if I pass any text file as input, it is taking every line as VALUE to
> Mapper..on which I will do some processing and put it to OutputCollector.
> But how hadoop parsed my text file into ( Key,Value ) pair and how can we
> tell hadoop what (key,value) it should give to mapper ??
>
> Thanks.
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434