You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by maha <ma...@umail.ucsb.edu> on 2010/12/15 11:13:59 UTC

Deprecated ... damaged?

Hi everyone,

  Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. So the number of Maps is equal to the number of input files. Yet, what I get is that each split contains multiple paths of input files, hence # of maps is < # of input files. Is it because "MultiFileInputFormat" is deprecated?

  In my implemented myMultiFileInputFormat I have only the following:

public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf job, Reporter reporter){
		return (new myRecordReader((MultiFileSplit) split));
	}

Yet, in myRecordReader, for example one split has the following;
  
  " /tmp/input/file1:0+300
    /tmp/input/file2:0+199  "

  instead of each line in its own split.

    Why? Any clues?

          Thank you,
              Maha

Re: Deprecated ... damaged?

Posted by maha <ma...@umail.ucsb.edu>.

Actually, I just realized that numSplits can't be modified "definitely". Even if I write numSplits = 5, it's just a hint. 

Then how come MultiFileInputFormat claims to use MultiFileSplit to contain one file/split ?? or is that also just a hint?

Maha

On Dec 15, 2010, at 2:13 AM, maha wrote:

> Hi everyone,
> 
>  Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. So the number of Maps is equal to the number of input files. Yet, what I get is that each split contains multiple paths of input files, hence # of maps is < # of input files. Is it because "MultiFileInputFormat" is deprecated?
> 
>  In my implemented myMultiFileInputFormat I have only the following:
> 
> public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf job, Reporter reporter){
> 		return (new myRecordReader((MultiFileSplit) split));
> 	}
> 
> Yet, in myRecordReader, for example one split has the following;
> 
>  " /tmp/input/file1:0+300
>    /tmp/input/file2:0+199  "
> 
>  instead of each line in its own split.
> 
>    Why? Any clues?
> 
>          Thank you,
>              Maha

Re: Deprecated ... damaged?

Posted by maha <ma...@umail.ucsb.edu>.

Hi Allen and thanks for responding ..

   You're answer actually gave me another clue, I set numSplits = numFiles*100; in myInputFormat and it worked :D ... Do you think there are side effects for doing that?

   Thank you,

       Maha

On Dec 15, 2010, at 12:16 PM, Allen Wittenauer wrote:

> 
> On Dec 15, 2010, at 2:13 AM, maha wrote:
> 
>> Hi everyone,
>> 
>> Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split.
> 
> 
> 	Is there some reason you don't just use normal InputFormat with an extremely high min.split.size?
>

Re: Deprecated ... damaged?

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Dec 15, 2010, at 2:13 AM, maha wrote:

> Hi everyone,
> 
>  Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split.


	Is there some reason you don't just use normal InputFormat with an extremely high min.split.size?