You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Shi Yu <sh...@uchicago.edu> on 2010/10/02 18:31:01 UTC

Total input paths number and output

Hi,

I am running some code on a cluster with several nodes (ranging from 1 
to 30) using hadoop-0.19.2. In a test,  I only put a single file under 
the input folder, however, each time I find the logged "total input 
paths to process" is 2 (not 1).

INFO mapred.FileInputFormat: Total input paths to process : 2

The obtained results generate two identical output files, on is named as 
-00000, another is named as -00001.  There is nothing really wrong, but 
why there are 2 inputs and 2 outputs? I also tried to reduce the cluster 
node to 1 (removing all the nodes in the conf/slaves file), also change 
the dfs.replication property in the xml file to 1, but no effect. I 
tried different input formats they are all the same. Where could I find 
the parameter to control this? Thanks.
-- 

Shi


Re: Total input paths number and output

Posted by Shi Yu <sh...@uchicago.edu>.
Sorry not primal number, "the prime number".

Shi

On 2010-10-2 14:56, Shi Yu wrote:
> Hi Harsh,
>
> I found the bug in my code,  I had two buggy lines
>
>         FileInputFormat.addInputPath(conf, new Path(args[0]));
>         FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
> after the line
>        JobConf conf = 
> JobBuilder.parseInputAndOutput(this,getConf(),args);
>
> Thus it adds additional input and output.
>
> If I set the reduce task number higher than 1, why it is best to be a 
> "primal number"? That's interesting could you explain or link to any 
> web page? Any theory behind that?
>
> Thanks again for you nice help. I look forward to your interesting 
> comment on "primal number".
>
> Best Regards,
>
> Shi
>
> On 2010-10-2 13:50, Harsh J wrote:
>> On Sat, Oct 2, 2010 at 11:35 PM, Shi Yu<sh...@uchicago.edu>  wrote:
>>> On 2010-10-2 12:01, Harsh J wrote:
>>>> mapred.min.split.size and minimum map tasks properties of Hadoop MR 
>>>> also
>>>> control the splitting of input for map talks.
>>>>
>>>> On Oct 2, 2010 10:28 PM, "Harsh J"<qw...@gmail.com>    wrote:
>>>>
>>>> Outputs are not dependent on number of inputs, but instead the 
>>>> number of
>>>> reducers (if MapReduce) or number of input splits if just plain Maps.
>>>>
>>>> The number of splits is determined in most cases by the input file 
>>>> sizes
>>>> and
>>>> the set HDFS block size factor (dfs.block.size) it was created under.
>>>>
>>>>
>>>>
>>>>> On Oct 2, 2010 10:01 PM, "Shi Yu"<sh...@uchicago.edu>    wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am running some cod...
>>>>>
>>>>
>>> Hi Harsh,
>>>
>>> Thanks for the answer. I understand what you have said. However, I was
>>> trying to see the effect in experiment. For example, I use the exact 
>>> same
>>> input (a 13M file) and try the simple WordCount example. I would 
>>> like to see
>>> whether my configuration could change the number appeared in the 
>>> log. The
>>> configuration in my main function is as follows:
>>>
>>>           JobConf conf = new JobConf(WordCount.class);
>>>           conf.setJobName("wordcount");
>>>           conf.setOutputKeyClass(Text.class);
>>>           conf.setOutputValueClass(IntWritable.class);
>>>           conf.setMapperClass(Map.class);
>>>           conf.setCombinerClass(Reduce.class);
>>>           conf.setReducerClass(Reduce.class);
>>>           conf.setMapOutputKeyClass(Text.class);
>>>           conf.setMapOutputValueClass(IntWritable.class);
>>>           conf.setInputFormat(ZipInputFormat.class);
>>>           conf.setInt("mapred.min.split.size",2);
>> The property "mapred.min.split.size" takes up its value in bytes. Some
>> input formats have their own splitting techniques, so also know that
>> it is not an enforced setting.
>>>           conf.setNumMapTasks(3);
>> For information's sake, by default, mapred.map.tasks is set as 2 in
>> Hadoop MR. It is considered as a hint since the input size / files
>> determine number of required maps but  with lesser data it still runs
>> the minimum set amount of maps (in order to use your cluster or
>> machine efficiently I s'pose).
>>> In the last two lines (mapred.min.split.size  and setNumMapTasks) I set
>>> different values, from 2 to 10.  But the log is always
>>>
>>> INFO mapred.FileInputFormat: Total input paths to process : 1
>>>
>>>
>>> Then I change to my real code using the exact same input, I set
>>>       conf.setNumMapTasks(1);
>>>       conf.setNumReduceTasks(1);
>>>
>>> The log shows
>>> INFO mapred.FileInputFormat: Total input paths to process : 2
>> I find this odd, FileInputFormat reports only the number of paths it
>> has to process under the directory it has got. If you specify a file
>> directly, it must not report 2.
>>
>> Unless its a doing of the ZipInputFormat where-in (I assume) it
>> reports number of files inside the zip file?
>>> What's wrong? Why I cannot see the direct effect of my settings. The 
>>> input
>>> file is 13M so it is smaller than the default block size 64M. I 
>>> leave that
>>> block size setting by default.
>>>
>>> Thanks.
>>>
>>> Best Regards,
>>>
>>> Shi
>>>
>>>
>> I was replying from a mobile device earlier so couldn't be very clear,
>> apologies.
>>
>> What you're asking for is a way to control the number of outputs,
>> correct? The number of input paths detected or the maps launched for
>> the input are not the determining methods of the final output when it
>> comes to jobs that have a Reduce phase.
>>
>> If you want a single-file output, you'd set job.setNumReduceTasks(1);
>> and so on for as many as you need. Usually the property
>> mapred.reduce.tasks (which the above method sets anyway) is set to a
>> prime number nearest to the number of tasktracker nodes. Although it
>> is not a necessity to do so, it helps parallelize the operation in a
>> neat manner.
>>
>> About controlling the input split behavior, it depends on the
>> InputFormat derivative you are using. FileInputFormats generate
>> minimum of n splits for n files but may run n+m mappers based on the
>> factoring of the files as-per the block size (or mapred.min.split.size
>> if set to a valid number other than 0, as it works with FIF). But yes
>> the "Total input paths to process" message it logs is basically the
>> size of the array of files it found valid under the path or list of
>> paths you supplied (FIF ignores . and _ prefixes if am right, and
>> doesn't count a dir).
>>
>> Are you sure that the directory which you are passing to FIF has only
>> one file under it? Or perhaps the ZipInputFormat has its own
>> path-listing techniques?
>>
>
>


-- 
Postdoctoral Scholar
Institute for Genomics and Systems Biology
Department of Medicine, the University of Chicago
Knapp Center for Biomedical Discovery
900 E. 57th St. Room 10148
Chicago, IL 60637, US
Tel: 773-702-6799


Re: Total input paths number and output

Posted by Shi Yu <sh...@uchicago.edu>.
Hi Harsh,

I found the bug in my code,  I had two buggy lines

         FileInputFormat.addInputPath(conf, new Path(args[0]));
         FileOutputFormat.setOutputPath(conf, new Path(args[1]));

after the line
        JobConf conf = JobBuilder.parseInputAndOutput(this,getConf(),args);

Thus it adds additional input and output.

If I set the reduce task number higher than 1, why it is best to be a 
"primal number"? That's interesting could you explain or link to any web 
page? Any theory behind that?

Thanks again for you nice help. I look forward to your interesting 
comment on "primal number".

Best Regards,

Shi

On 2010-10-2 13:50, Harsh J wrote:
> On Sat, Oct 2, 2010 at 11:35 PM, Shi Yu<sh...@uchicago.edu>  wrote:
>    
>> On 2010-10-2 12:01, Harsh J wrote:
>>      
>>> mapred.min.split.size and minimum map tasks properties of Hadoop MR also
>>> control the splitting of input for map talks.
>>>
>>> On Oct 2, 2010 10:28 PM, "Harsh J"<qw...@gmail.com>    wrote:
>>>
>>> Outputs are not dependent on number of inputs, but instead the number of
>>> reducers (if MapReduce) or number of input splits if just plain Maps.
>>>
>>> The number of splits is determined in most cases by the input file sizes
>>> and
>>> the set HDFS block size factor (dfs.block.size) it was created under.
>>>
>>>
>>>
>>>        
>>>> On Oct 2, 2010 10:01 PM, "Shi Yu"<sh...@uchicago.edu>    wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am running some cod...
>>>>
>>>>          
>>>
>>>        
>> Hi Harsh,
>>
>> Thanks for the answer. I understand what you have said. However, I was
>> trying to see the effect in experiment. For example, I use the exact same
>> input (a 13M file) and try the simple WordCount example. I would like to see
>> whether my configuration could change the number appeared in the log. The
>> configuration in my main function is as follows:
>>
>>           JobConf conf = new JobConf(WordCount.class);
>>           conf.setJobName("wordcount");
>>           conf.setOutputKeyClass(Text.class);
>>           conf.setOutputValueClass(IntWritable.class);
>>           conf.setMapperClass(Map.class);
>>           conf.setCombinerClass(Reduce.class);
>>           conf.setReducerClass(Reduce.class);
>>           conf.setMapOutputKeyClass(Text.class);
>>           conf.setMapOutputValueClass(IntWritable.class);
>>           conf.setInputFormat(ZipInputFormat.class);
>>           conf.setInt("mapred.min.split.size",2);
>>      
> The property "mapred.min.split.size" takes up its value in bytes. Some
> input formats have their own splitting techniques, so also know that
> it is not an enforced setting.
>    
>>           conf.setNumMapTasks(3);
>>      
> For information's sake, by default, mapred.map.tasks is set as 2 in
> Hadoop MR. It is considered as a hint since the input size / files
> determine number of required maps but  with lesser data it still runs
> the minimum set amount of maps (in order to use your cluster or
> machine efficiently I s'pose).
>    
>> In the last two lines (mapred.min.split.size  and setNumMapTasks) I set
>> different values, from 2 to 10.  But the log is always
>>
>> INFO mapred.FileInputFormat: Total input paths to process : 1
>>
>>
>> Then I change to my real code using the exact same input, I set
>>       conf.setNumMapTasks(1);
>>       conf.setNumReduceTasks(1);
>>
>> The log shows
>> INFO mapred.FileInputFormat: Total input paths to process : 2
>>      
> I find this odd, FileInputFormat reports only the number of paths it
> has to process under the directory it has got. If you specify a file
> directly, it must not report 2.
>
> Unless its a doing of the ZipInputFormat where-in (I assume) it
> reports number of files inside the zip file?
>    
>> What's wrong? Why I cannot see the direct effect of my settings. The input
>> file is 13M so it is smaller than the default block size 64M. I leave that
>> block size setting by default.
>>
>> Thanks.
>>
>> Best Regards,
>>
>> Shi
>>
>>
>>      
> I was replying from a mobile device earlier so couldn't be very clear,
> apologies.
>
> What you're asking for is a way to control the number of outputs,
> correct? The number of input paths detected or the maps launched for
> the input are not the determining methods of the final output when it
> comes to jobs that have a Reduce phase.
>
> If you want a single-file output, you'd set job.setNumReduceTasks(1);
> and so on for as many as you need. Usually the property
> mapred.reduce.tasks (which the above method sets anyway) is set to a
> prime number nearest to the number of tasktracker nodes. Although it
> is not a necessity to do so, it helps parallelize the operation in a
> neat manner.
>
> About controlling the input split behavior, it depends on the
> InputFormat derivative you are using. FileInputFormats generate
> minimum of n splits for n files but may run n+m mappers based on the
> factoring of the files as-per the block size (or mapred.min.split.size
> if set to a valid number other than 0, as it works with FIF). But yes
> the "Total input paths to process" message it logs is basically the
> size of the array of files it found valid under the path or list of
> paths you supplied (FIF ignores . and _ prefixes if am right, and
> doesn't count a dir).
>
> Are you sure that the directory which you are passing to FIF has only
> one file under it? Or perhaps the ZipInputFormat has its own
> path-listing techniques?
>
>    


-- 
Postdoctoral Scholar
Institute for Genomics and Systems Biology
Department of Medicine, the University of Chicago
Knapp Center for Biomedical Discovery
900 E. 57th St. Room 10148
Chicago, IL 60637, US
Tel: 773-702-6799


Re: Total input paths number and output

Posted by Harsh J <qw...@gmail.com>.
On Sat, Oct 2, 2010 at 11:35 PM, Shi Yu <sh...@uchicago.edu> wrote:
> On 2010-10-2 12:01, Harsh J wrote:
>>
>> mapred.min.split.size and minimum map tasks properties of Hadoop MR also
>> control the splitting of input for map talks.
>>
>> On Oct 2, 2010 10:28 PM, "Harsh J"<qw...@gmail.com>  wrote:
>>
>> Outputs are not dependent on number of inputs, but instead the number of
>> reducers (if MapReduce) or number of input splits if just plain Maps.
>>
>> The number of splits is determined in most cases by the input file sizes
>> and
>> the set HDFS block size factor (dfs.block.size) it was created under.
>>
>>
>>
>>>
>>> On Oct 2, 2010 10:01 PM, "Shi Yu"<sh...@uchicago.edu>  wrote:
>>>
>>> Hi,
>>>
>>> I am running some cod...
>>>
>>
>>
>
> Hi Harsh,
>
> Thanks for the answer. I understand what you have said. However, I was
> trying to see the effect in experiment. For example, I use the exact same
> input (a 13M file) and try the simple WordCount example. I would like to see
> whether my configuration could change the number appeared in the log. The
> configuration in my main function is as follows:
>
>          JobConf conf = new JobConf(WordCount.class);
>          conf.setJobName("wordcount");
>          conf.setOutputKeyClass(Text.class);
>          conf.setOutputValueClass(IntWritable.class);
>          conf.setMapperClass(Map.class);
>          conf.setCombinerClass(Reduce.class);
>          conf.setReducerClass(Reduce.class);
>          conf.setMapOutputKeyClass(Text.class);
>          conf.setMapOutputValueClass(IntWritable.class);
>          conf.setInputFormat(ZipInputFormat.class);
>          conf.setInt("mapred.min.split.size",2);
The property "mapred.min.split.size" takes up its value in bytes. Some
input formats have their own splitting techniques, so also know that
it is not an enforced setting.
>          conf.setNumMapTasks(3);
For information's sake, by default, mapred.map.tasks is set as 2 in
Hadoop MR. It is considered as a hint since the input size / files
determine number of required maps but  with lesser data it still runs
the minimum set amount of maps (in order to use your cluster or
machine efficiently I s'pose).
>
> In the last two lines (mapred.min.split.size  and setNumMapTasks) I set
> different values, from 2 to 10.  But the log is always
>
> INFO mapred.FileInputFormat: Total input paths to process : 1
>
>
> Then I change to my real code using the exact same input, I set
>      conf.setNumMapTasks(1);
>      conf.setNumReduceTasks(1);
>
> The log shows
> INFO mapred.FileInputFormat: Total input paths to process : 2
I find this odd, FileInputFormat reports only the number of paths it
has to process under the directory it has got. If you specify a file
directly, it must not report 2.

Unless its a doing of the ZipInputFormat where-in (I assume) it
reports number of files inside the zip file?
>
> What's wrong? Why I cannot see the direct effect of my settings. The input
> file is 13M so it is smaller than the default block size 64M. I leave that
> block size setting by default.
>
> Thanks.
>
> Best Regards,
>
> Shi
>
>

I was replying from a mobile device earlier so couldn't be very clear,
apologies.

What you're asking for is a way to control the number of outputs,
correct? The number of input paths detected or the maps launched for
the input are not the determining methods of the final output when it
comes to jobs that have a Reduce phase.

If you want a single-file output, you'd set job.setNumReduceTasks(1);
and so on for as many as you need. Usually the property
mapred.reduce.tasks (which the above method sets anyway) is set to a
prime number nearest to the number of tasktracker nodes. Although it
is not a necessity to do so, it helps parallelize the operation in a
neat manner.

About controlling the input split behavior, it depends on the
InputFormat derivative you are using. FileInputFormats generate
minimum of n splits for n files but may run n+m mappers based on the
factoring of the files as-per the block size (or mapred.min.split.size
if set to a valid number other than 0, as it works with FIF). But yes
the "Total input paths to process" message it logs is basically the
size of the array of files it found valid under the path or list of
paths you supplied (FIF ignores . and _ prefixes if am right, and
doesn't count a dir).

Are you sure that the directory which you are passing to FIF has only
one file under it? Or perhaps the ZipInputFormat has its own
path-listing techniques?

-- 
Harsh J
www.harshj.com

Re: Total input paths number and output

Posted by Shi Yu <sh...@uchicago.edu>.
On 2010-10-2 12:01, Harsh J wrote:
> mapred.min.split.size and minimum map tasks properties of Hadoop MR also
> control the splitting of input for map talks.
>
> On Oct 2, 2010 10:28 PM, "Harsh J"<qw...@gmail.com>  wrote:
>
> Outputs are not dependent on number of inputs, but instead the number of
> reducers (if MapReduce) or number of input splits if just plain Maps.
>
> The number of splits is determined in most cases by the input file sizes and
> the set HDFS block size factor (dfs.block.size) it was created under.
>
>
>    
>> On Oct 2, 2010 10:01 PM, "Shi Yu"<sh...@uchicago.edu>  wrote:
>>
>> Hi,
>>
>> I am running some cod...
>>      
>    

Hi Harsh,

Thanks for the answer. I understand what you have said. However, I was 
trying to see the effect in experiment. For example, I use the exact 
same input (a 13M file) and try the simple WordCount example. I would 
like to see whether my configuration could change the number appeared in 
the log. The configuration in my main function is as follows:

           JobConf conf = new JobConf(WordCount.class);
           conf.setJobName("wordcount");
           conf.setOutputKeyClass(Text.class);
           conf.setOutputValueClass(IntWritable.class);
           conf.setMapperClass(Map.class);
           conf.setCombinerClass(Reduce.class);
           conf.setReducerClass(Reduce.class);
           conf.setMapOutputKeyClass(Text.class);
           conf.setMapOutputValueClass(IntWritable.class);
           conf.setInputFormat(ZipInputFormat.class);
           conf.setInt("mapred.min.split.size",2);
           conf.setNumMapTasks(3);

In the last two lines (mapred.min.split.size  and setNumMapTasks) I set 
different values, from 2 to 10.  But the log is always

INFO mapred.FileInputFormat: Total input paths to process : 1


Then I change to my real code using the exact same input, I set
       conf.setNumMapTasks(1);
       conf.setNumReduceTasks(1);

The log shows
INFO mapred.FileInputFormat: Total input paths to process : 2

What's wrong? Why I cannot see the direct effect of my settings. The input file is 13M so it is smaller than the default block size 64M. I leave that block size setting by default.

Thanks.

Best Regards,

Shi
  


Re: Total input paths number and output

Posted by Harsh J <qw...@gmail.com>.
mapred.min.split.size and minimum map tasks properties of Hadoop MR also
control the splitting of input for map talks.

On Oct 2, 2010 10:28 PM, "Harsh J" <qw...@gmail.com> wrote:

Outputs are not dependent on number of inputs, but instead the number of
reducers (if MapReduce) or number of input splits if just plain Maps.

The number of splits is determined in most cases by the input file sizes and
the set HDFS block size factor (dfs.block.size) it was created under.


>
> On Oct 2, 2010 10:01 PM, "Shi Yu" <sh...@uchicago.edu> wrote:
>
> Hi,
>
> I am running some cod...

Re: Total input paths number and output

Posted by Harsh J <qw...@gmail.com>.
Outputs are not dependent on number of inputs, but instead the number of
reducers (if MapReduce) or number of input splits if just plain Maps.

The number of splits is determined in most cases by the input file sizes and
the set HDFS block size factor (dfs.block.size) it was created under.

On Oct 2, 2010 10:01 PM, "Shi Yu" <sh...@uchicago.edu> wrote:

Hi,

I am running some code on a cluster with several nodes (ranging from 1 to
30) using hadoop-0.19.2. In a test,  I only put a single file under the
input folder, however, each time I find the logged "total input paths to
process" is 2 (not 1).

INFO mapred.FileInputFormat: Total input paths to process : 2

The obtained results generate two identical output files, on is named as
-00000, another is named as -00001.  There is nothing really wrong, but why
there are 2 inputs and 2 outputs? I also tried to reduce the cluster node to
1 (removing all the nodes in the conf/slaves file), also change the
dfs.replication property in the xml file to 1, but no effect. I tried
different input formats they are all the same. Where could I find the
parameter to control this? Thanks.
-- 

Shi