You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2013/09/21 21:30:05 UTC

A couple of Questions on InputFormat

Classes implementing InputFormat implement
 public List<InputSplit> getSplits(JobContext job) which a List if
InputSplits. for FileInputFormat the Splits have Path.start and End

1) When is this method called and on which JVM on Which Machine and is it
called only once?

2) Do the number of Map task correspond to the number of splits returned by
getSplits?

3) InputFormat implements a method
 RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext
context ). Is this  executed within the JVM of the Mapper on the slave
machine and does the RecordReader run within that JVM

4) The default RecordReaders read a file from the start position to the end
position emitting values in the order read. With such a reader, assume it
is reading lines of text, is it reasonable to assume that the values the
mapper received are in the same order they were found in a file? Would it,
for example, be possible for WordCount to see a word that was hyphen-
ated at the end of one line and append the first word of the next line it
sees (ignoring the case where the word is at the end of a split)

Re: A couple of Questions on InputFormat

Posted by Harsh J <ha...@cloudera.com>.
Hi,

Yes, that is right.

On Mon, Sep 23, 2013 at 9:04 PM, Steve Lewis <lo...@gmail.com> wrote:
> Thank you for your thorough answer
> The last question is essentially this - while I can write a custom input
> format to handle things like hyphens I
> could do almost the same thing in the mapper by saving any hyphenated words
> from the last line (ignoring hyphenated words that
> cross a split boundary) as long as  LineRecordReader guarantees that each
> line in the split is sent to the same mapper in the order read.
> This seems to be the case - right?
>
>
> On Mon, Sep 23, 2013 at 4:30 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi,
>>
>> (I'm assuming 1.0~ MR here)
>>
>> On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis <lo...@gmail.com>
>> wrote:
>> > Classes implementing InputFormat implement
>> >  public List<InputSplit> getSplits(JobContext job) which a List if
>> > InputSplits. for FileInputFormat the Splits have Path.start and End
>> >
>> > 1) When is this method called and on which JVM on Which Machine and is
>> > it
>> > called only once?
>>
>> Called only at a client, i.e. your "hadoop jar" JVM. Called only once.
>>
>> > 2) Do the number of Map task correspond to the number of splits returned
>> > by
>> > getSplits?
>>
>> Yes, number of split objects == number of mappers.
>>
>> > 3) InputFormat implements a method
>> >  RecordReader<K,V> createRecordReader(InputSplit
>> > split,TaskAttemptContext
>> > context ). Is this  executed within the JVM of the Mapper on the slave
>> > machine and does the RecordReader run within that JVM
>>
>> RecordReaders are not created on the client side JVM. RecordReaders
>> are created on the Map task JVMs, and run inside it.
>>
>> > 4) The default RecordReaders read a file from the start position to the
>> > end
>> > position emitting values in the order read. With such a reader, assume
>> > it is
>> > reading lines of text, is it reasonable to assume that the values the
>> > mapper
>> > received are in the same order they were found in a file? Would it, for
>> > example, be possible for WordCount to see a word that was hyphen-
>> > ated at the end of one line and append the first word of the next line
>> > it
>> > sees (ignoring the case where the word is at the end of a split)
>>
>> If you speak of the LineRecordReader, each map() will simply read a
>> line, i.e. until \n. It is not language-aware to understand meaning of
>> hyphens, etc..
>>
>> You can implement a custom reader to do this however - there should be
>> no problems so long as your logic covers the case of not having any
>> duplicate reads across multiple maps.
>>
>> --
>> Harsh J
>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>



-- 
Harsh J

Re: A couple of Questions on InputFormat

Posted by Steve Lewis <lo...@gmail.com>.
Thank you for your thorough answer
The last question is essentially this - while I can write a custom input
format to handle things like hyphens I
could do almost the same thing in the mapper by saving any hyphenated words
from the last line (ignoring hyphenated words that
cross a split boundary) as long as  LineRecordReader guarantees that each
line in the split is sent to the same mapper in the order read.
This seems to be the case - right?


On Mon, Sep 23, 2013 at 4:30 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> (I'm assuming 1.0~ MR here)
>
> On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis <lo...@gmail.com>
> wrote:
> > Classes implementing InputFormat implement
> >  public List<InputSplit> getSplits(JobContext job) which a List if
> > InputSplits. for FileInputFormat the Splits have Path.start and End
> >
> > 1) When is this method called and on which JVM on Which Machine and is it
> > called only once?
>
> Called only at a client, i.e. your "hadoop jar" JVM. Called only once.
>
> > 2) Do the number of Map task correspond to the number of splits returned
> by
> > getSplits?
>
> Yes, number of split objects == number of mappers.
>
> > 3) InputFormat implements a method
> >  RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext
> > context ). Is this  executed within the JVM of the Mapper on the slave
> > machine and does the RecordReader run within that JVM
>
> RecordReaders are not created on the client side JVM. RecordReaders
> are created on the Map task JVMs, and run inside it.
>
> > 4) The default RecordReaders read a file from the start position to the
> end
> > position emitting values in the order read. With such a reader, assume
> it is
> > reading lines of text, is it reasonable to assume that the values the
> mapper
> > received are in the same order they were found in a file? Would it, for
> > example, be possible for WordCount to see a word that was hyphen-
> > ated at the end of one line and append the first word of the next line it
> > sees (ignoring the case where the word is at the end of a split)
>
> If you speak of the LineRecordReader, each map() will simply read a
> line, i.e. until \n. It is not language-aware to understand meaning of
> hyphens, etc..
>
> You can implement a custom reader to do this however - there should be
> no problems so long as your logic covers the case of not having any
> duplicate reads across multiple maps.
>
> --
> Harsh J
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: A couple of Questions on InputFormat

Posted by Harsh J <ha...@cloudera.com>.
Hi,

(I'm assuming 1.0~ MR here)

On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis <lo...@gmail.com> wrote:
> Classes implementing InputFormat implement
>  public List<InputSplit> getSplits(JobContext job) which a List if
> InputSplits. for FileInputFormat the Splits have Path.start and End
>
> 1) When is this method called and on which JVM on Which Machine and is it
> called only once?

Called only at a client, i.e. your "hadoop jar" JVM. Called only once.

> 2) Do the number of Map task correspond to the number of splits returned by
> getSplits?

Yes, number of split objects == number of mappers.

> 3) InputFormat implements a method
>  RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext
> context ). Is this  executed within the JVM of the Mapper on the slave
> machine and does the RecordReader run within that JVM

RecordReaders are not created on the client side JVM. RecordReaders
are created on the Map task JVMs, and run inside it.

> 4) The default RecordReaders read a file from the start position to the end
> position emitting values in the order read. With such a reader, assume it is
> reading lines of text, is it reasonable to assume that the values the mapper
> received are in the same order they were found in a file? Would it, for
> example, be possible for WordCount to see a word that was hyphen-
> ated at the end of one line and append the first word of the next line it
> sees (ignoring the case where the word is at the end of a split)

If you speak of the LineRecordReader, each map() will simply read a
line, i.e. until \n. It is not language-aware to understand meaning of
hyphens, etc..

You can implement a custom reader to do this however - there should be
no problems so long as your logic covers the case of not having any
duplicate reads across multiple maps.

-- 
Harsh J