You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by maha <ma...@umail.ucsb.edu> on 2010/12/15 11:13:59 UTC
Deprecated ... damaged?
Hi everyone,
Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. So the number of Maps is equal to the number of input files. Yet, what I get is that each split contains multiple paths of input files, hence # of maps is < # of input files. Is it because "MultiFileInputFormat" is deprecated?
In my implemented myMultiFileInputFormat I have only the following:
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf job, Reporter reporter){
return (new myRecordReader((MultiFileSplit) split));
}
Yet, in myRecordReader, for example one split has the following;
" /tmp/input/file1:0+300
/tmp/input/file2:0+199 "
instead of each line in its own split.
Why? Any clues?
Thank you,
Maha
Re: Deprecated ... damaged?
Posted by maha <ma...@umail.ucsb.edu>.
Actually, I just realized that numSplits can't be modified "definitely". Even if I write numSplits = 5, it's just a hint.
Then how come MultiFileInputFormat claims to use MultiFileSplit to contain one file/split ?? or is that also just a hint?
Maha
On Dec 15, 2010, at 2:13 AM, maha wrote:
> Hi everyone,
>
> Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. So the number of Maps is equal to the number of input files. Yet, what I get is that each split contains multiple paths of input files, hence # of maps is < # of input files. Is it because "MultiFileInputFormat" is deprecated?
>
> In my implemented myMultiFileInputFormat I have only the following:
>
> public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf job, Reporter reporter){
> return (new myRecordReader((MultiFileSplit) split));
> }
>
> Yet, in myRecordReader, for example one split has the following;
>
> " /tmp/input/file1:0+300
> /tmp/input/file2:0+199 "
>
> instead of each line in its own split.
>
> Why? Any clues?
>
> Thank you,
> Maha
Re: Deprecated ... damaged?
Posted by maha <ma...@umail.ucsb.edu>.
Hi Allen and thanks for responding ..
You're answer actually gave me another clue, I set numSplits = numFiles*100; in myInputFormat and it worked :D ... Do you think there are side effects for doing that?
Thank you,
Maha
On Dec 15, 2010, at 12:16 PM, Allen Wittenauer wrote:
>
> On Dec 15, 2010, at 2:13 AM, maha wrote:
>
>> Hi everyone,
>>
>> Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split.
>
>
> Is there some reason you don't just use normal InputFormat with an extremely high min.split.size?
>
Re: Deprecated ... damaged?
Posted by Allen Wittenauer <aw...@linkedin.com>.
On Dec 15, 2010, at 2:13 AM, maha wrote:
> Hi everyone,
>
> Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split.
Is there some reason you don't just use normal InputFormat with an extremely high min.split.size?