You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by "Agarwal, Nikhil" <Ni...@netapp.com> on 2013/05/13 09:20:21 UTC

How to combine input files for a MapReduce job

Hi,

I  have a 3-node cluster, with JobTracker running on one machine and TaskTrackers on other two. Instead of using HDFS, I have written my own FileSystem implementation. As an experiment, I kept 1000 text files (all of same size) on both the slave nodes and ran a simple Wordcount MR job. It took around 50 mins to complete the task. Afterwards, I concatenated all the 1000 files into a single file and then ran a Wordcount MR job, it took 35 secs. From the JobTracker UI I could make out that the problem is because of the number of mappers that JobTracker is creating. For 1000 files it creates 1000 maps and for 1 file it creates 1 map (irrespective of file size).

Thus, is there a way to reduce the number of mappers i.e. can I control the number of mappers through some configuration parameter so that Hadoop would club all the files until it reaches some specified size (say, 64 MB) and then make 1 map per 64 MB block?

Also, I wanted to know how to see which file is being submitted to which TaskTracker or if that is not possible then how do I check if some data transfer is happening in between my slave nodes during a MR job?

Sorry for so many questions and Thank you for your time.

Regards,
Nikhil

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi,

I got it. The log info is printed in userlogs folder in slave nodes, in the file syslog.

Thanks,
Nikhil

-----Original Message-----
From: Agarwal, Nikhil 
Sent: Monday, May 13, 2013 4:10 PM
To: 'user@hadoop.apache.org'
Subject: RE: How to combine input files for a MapReduce job

Hi Harsh,

I applied the changes of the patch in hadoop source code but can you please tell exactly where is this log being printed? I checked in log files of JobTracker and TaskTracker but it is not there. It is not getting printed in _logs folder creates inside output directory for MR job.

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, May 13, 2013 1:28 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use 
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib
> /CombineFileInputFormat.html which is designed to solve similar cases. 
> However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two. Instead of using HDFS, I have written my 
>> own FileSystem implementation. As an experiment, I kept 1000 text 
>> files (all of same size) on both the slave nodes and ran a simple 
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it 
>> took
>> 35 secs. From the JobTracker UI I could make out that the problem is 
>> because of the number of mappers that JobTracker is creating. For
>> 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I 
>> control the number of mappers through some configuration parameter so 
>> that Hadoop would club all the files until it reaches some specified 
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to 
>> which TaskTracker or if that is not possible then how do I check if 
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



--
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi,

I got it. The log info is printed in userlogs folder in slave nodes, in the file syslog.

Thanks,
Nikhil

-----Original Message-----
From: Agarwal, Nikhil 
Sent: Monday, May 13, 2013 4:10 PM
To: 'user@hadoop.apache.org'
Subject: RE: How to combine input files for a MapReduce job

Hi Harsh,

I applied the changes of the patch in hadoop source code but can you please tell exactly where is this log being printed? I checked in log files of JobTracker and TaskTracker but it is not there. It is not getting printed in _logs folder creates inside output directory for MR job.

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, May 13, 2013 1:28 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use 
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib
> /CombineFileInputFormat.html which is designed to solve similar cases. 
> However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two. Instead of using HDFS, I have written my 
>> own FileSystem implementation. As an experiment, I kept 1000 text 
>> files (all of same size) on both the slave nodes and ran a simple 
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it 
>> took
>> 35 secs. From the JobTracker UI I could make out that the problem is 
>> because of the number of mappers that JobTracker is creating. For
>> 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I 
>> control the number of mappers through some configuration parameter so 
>> that Hadoop would club all the files until it reaches some specified 
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to 
>> which TaskTracker or if that is not possible then how do I check if 
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



--
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi,

I got it. The log info is printed in userlogs folder in slave nodes, in the file syslog.

Thanks,
Nikhil

-----Original Message-----
From: Agarwal, Nikhil 
Sent: Monday, May 13, 2013 4:10 PM
To: 'user@hadoop.apache.org'
Subject: RE: How to combine input files for a MapReduce job

Hi Harsh,

I applied the changes of the patch in hadoop source code but can you please tell exactly where is this log being printed? I checked in log files of JobTracker and TaskTracker but it is not there. It is not getting printed in _logs folder creates inside output directory for MR job.

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, May 13, 2013 1:28 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use 
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib
> /CombineFileInputFormat.html which is designed to solve similar cases. 
> However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two. Instead of using HDFS, I have written my 
>> own FileSystem implementation. As an experiment, I kept 1000 text 
>> files (all of same size) on both the slave nodes and ran a simple 
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it 
>> took
>> 35 secs. From the JobTracker UI I could make out that the problem is 
>> because of the number of mappers that JobTracker is creating. For
>> 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I 
>> control the number of mappers through some configuration parameter so 
>> that Hadoop would club all the files until it reaches some specified 
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to 
>> which TaskTracker or if that is not possible then how do I check if 
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



--
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi Harsh,

I applied the changes of the patch in hadoop source code but can you please tell exactly where is this log being printed? I checked in log files of JobTracker and TaskTracker but it is not there. It is not getting printed in _logs folder creates inside output directory for MR job.

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, May 13, 2013 1:28 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use 
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib
> /CombineFileInputFormat.html which is designed to solve similar cases. 
> However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two. Instead of using HDFS, I have written my 
>> own FileSystem implementation. As an experiment, I kept 1000 text 
>> files (all of same size) on both the slave nodes and ran a simple 
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it 
>> took
>> 35 secs. From the JobTracker UI I could make out that the problem is 
>> because of the number of mappers that JobTracker is creating. For 
>> 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I 
>> control the number of mappers through some configuration parameter so 
>> that Hadoop would club all the files until it reaches some specified 
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to 
>> which TaskTracker or if that is not possible then how do I check if 
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



--
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi Harsh,

I applied the changes of the patch in hadoop source code but can you please tell exactly where is this log being printed? I checked in log files of JobTracker and TaskTracker but it is not there. It is not getting printed in _logs folder creates inside output directory for MR job.

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, May 13, 2013 1:28 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use 
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib
> /CombineFileInputFormat.html which is designed to solve similar cases. 
> However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two. Instead of using HDFS, I have written my 
>> own FileSystem implementation. As an experiment, I kept 1000 text 
>> files (all of same size) on both the slave nodes and ran a simple 
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it 
>> took
>> 35 secs. From the JobTracker UI I could make out that the problem is 
>> because of the number of mappers that JobTracker is creating. For 
>> 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I 
>> control the number of mappers through some configuration parameter so 
>> that Hadoop would club all the files until it reaches some specified 
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to 
>> which TaskTracker or if that is not possible then how do I check if 
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



--
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi Harsh,

I applied the changes of the patch in hadoop source code but can you please tell exactly where is this log being printed? I checked in log files of JobTracker and TaskTracker but it is not there. It is not getting printed in _logs folder creates inside output directory for MR job.

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, May 13, 2013 1:28 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use 
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib
> /CombineFileInputFormat.html which is designed to solve similar cases. 
> However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two. Instead of using HDFS, I have written my 
>> own FileSystem implementation. As an experiment, I kept 1000 text 
>> files (all of same size) on both the slave nodes and ran a simple 
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it 
>> took
>> 35 secs. From the JobTracker UI I could make out that the problem is 
>> because of the number of mappers that JobTracker is creating. For 
>> 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I 
>> control the number of mappers through some configuration parameter so 
>> that Hadoop would club all the files until it reaches some specified 
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to 
>> which TaskTracker or if that is not possible then how do I check if 
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



--
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi Harsh,

I applied the changes of the patch in hadoop source code but can you please tell exactly where is this log being printed? I checked in log files of JobTracker and TaskTracker but it is not there. It is not getting printed in _logs folder creates inside output directory for MR job.

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, May 13, 2013 1:28 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use 
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib
> /CombineFileInputFormat.html which is designed to solve similar cases. 
> However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two. Instead of using HDFS, I have written my 
>> own FileSystem implementation. As an experiment, I kept 1000 text 
>> files (all of same size) on both the slave nodes and ran a simple 
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it 
>> took
>> 35 secs. From the JobTracker UI I could make out that the problem is 
>> because of the number of mappers that JobTracker is creating. For 
>> 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I 
>> control the number of mappers through some configuration parameter so 
>> that Hadoop would club all the files until it reaches some specified 
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to 
>> which TaskTracker or if that is not possible then how do I check if 
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



--
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi,

I got it. The log info is printed in userlogs folder in slave nodes, in the file syslog.

Thanks,
Nikhil

-----Original Message-----
From: Agarwal, Nikhil 
Sent: Monday, May 13, 2013 4:10 PM
To: 'user@hadoop.apache.org'
Subject: RE: How to combine input files for a MapReduce job

Hi Harsh,

I applied the changes of the patch in hadoop source code but can you please tell exactly where is this log being printed? I checked in log files of JobTracker and TaskTracker but it is not there. It is not getting printed in _logs folder creates inside output directory for MR job.

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, May 13, 2013 1:28 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use 
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib
> /CombineFileInputFormat.html which is designed to solve similar cases. 
> However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two. Instead of using HDFS, I have written my 
>> own FileSystem implementation. As an experiment, I kept 1000 text 
>> files (all of same size) on both the slave nodes and ran a simple 
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it 
>> took
>> 35 secs. From the JobTracker UI I could make out that the problem is 
>> because of the number of mappers that JobTracker is creating. For
>> 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I 
>> control the number of mappers through some configuration parameter so 
>> that Hadoop would club all the files until it reaches some specified 
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to 
>> which TaskTracker or if that is not possible then how do I check if 
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



--
Harsh J

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
> which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my
>> own FileSystem implementation. As an experiment, I kept 1000 text
>> files (all of same size) on both the slave nodes and ran a simple
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took
>> 35 secs. From the JobTracker UI I could make out that the problem is
>> because of the number of mappers that JobTracker is creating. For 1000
>> files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I
>> control the number of mappers through some configuration parameter so
>> that Hadoop would club all the files until it reaches some specified
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to
>> which TaskTracker or if that is not possible then how do I check if
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
> which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my
>> own FileSystem implementation. As an experiment, I kept 1000 text
>> files (all of same size) on both the slave nodes and ran a simple
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took
>> 35 secs. From the JobTracker UI I could make out that the problem is
>> because of the number of mappers that JobTracker is creating. For 1000
>> files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I
>> control the number of mappers through some configuration parameter so
>> that Hadoop would club all the files until it reaches some specified
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to
>> which TaskTracker or if that is not possible then how do I check if
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
> which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my
>> own FileSystem implementation. As an experiment, I kept 1000 text
>> files (all of same size) on both the slave nodes and ran a simple
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took
>> 35 secs. From the JobTracker UI I could make out that the problem is
>> because of the number of mappers that JobTracker is creating. For 1000
>> files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I
>> control the number of mappers through some configuration parameter so
>> that Hadoop would club all the files until it reaches some specified
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to
>> which TaskTracker or if that is not possible then how do I check if
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
> which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my
>> own FileSystem implementation. As an experiment, I kept 1000 text
>> files (all of same size) on both the slave nodes and ran a simple
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took
>> 35 secs. From the JobTracker UI I could make out that the problem is
>> because of the number of mappers that JobTracker is creating. For 1000
>> files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I
>> control the number of mappers through some configuration parameter so
>> that Hadoop would club all the files until it reaches some specified
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to
>> which TaskTracker or if that is not possible then how do I check if
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



-- 
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi,

@Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, May 13, 2013 1:03 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.

For "which file is being submitted to which" question: Having
https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and 
> TaskTrackers on other two. Instead of using HDFS, I have written my 
> own FileSystem implementation. As an experiment, I kept 1000 text 
> files (all of same size) on both the slave nodes and ran a simple 
> Wordcount MR job. It took around 50 mins to complete the task. 
> Afterwards, I concatenated all the
> 1000 files into a single file and then ran a Wordcount MR job, it took 
> 35 secs. From the JobTracker UI I could make out that the problem is 
> because of the number of mappers that JobTracker is creating. For 1000 
> files it creates
> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>
>
>
> Thus, is there a way to reduce the number of mappers i.e. can I 
> control the number of mappers through some configuration parameter so 
> that Hadoop would club all the files until it reaches some specified 
> size (say, 64 MB) and then make 1 map per 64 MB block?
>
>
>
> Also, I wanted to know how to see which file is being submitted to 
> which TaskTracker or if that is not possible then how do I check if 
> some data transfer is happening in between my slave nodes during a MR job?
>
>
>
> Sorry for so many questions and Thank you for your time.
>
>
>
> Regards,
>
> Nikhil



--
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi,

@Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, May 13, 2013 1:03 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.

For "which file is being submitted to which" question: Having
https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and 
> TaskTrackers on other two. Instead of using HDFS, I have written my 
> own FileSystem implementation. As an experiment, I kept 1000 text 
> files (all of same size) on both the slave nodes and ran a simple 
> Wordcount MR job. It took around 50 mins to complete the task. 
> Afterwards, I concatenated all the
> 1000 files into a single file and then ran a Wordcount MR job, it took 
> 35 secs. From the JobTracker UI I could make out that the problem is 
> because of the number of mappers that JobTracker is creating. For 1000 
> files it creates
> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>
>
>
> Thus, is there a way to reduce the number of mappers i.e. can I 
> control the number of mappers through some configuration parameter so 
> that Hadoop would club all the files until it reaches some specified 
> size (say, 64 MB) and then make 1 map per 64 MB block?
>
>
>
> Also, I wanted to know how to see which file is being submitted to 
> which TaskTracker or if that is not possible then how do I check if 
> some data transfer is happening in between my slave nodes during a MR job?
>
>
>
> Sorry for so many questions and Thank you for your time.
>
>
>
> Regards,
>
> Nikhil



--
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi,

@Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, May 13, 2013 1:03 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.

For "which file is being submitted to which" question: Having
https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and 
> TaskTrackers on other two. Instead of using HDFS, I have written my 
> own FileSystem implementation. As an experiment, I kept 1000 text 
> files (all of same size) on both the slave nodes and ran a simple 
> Wordcount MR job. It took around 50 mins to complete the task. 
> Afterwards, I concatenated all the
> 1000 files into a single file and then ran a Wordcount MR job, it took 
> 35 secs. From the JobTracker UI I could make out that the problem is 
> because of the number of mappers that JobTracker is creating. For 1000 
> files it creates
> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>
>
>
> Thus, is there a way to reduce the number of mappers i.e. can I 
> control the number of mappers through some configuration parameter so 
> that Hadoop would club all the files until it reaches some specified 
> size (say, 64 MB) and then make 1 map per 64 MB block?
>
>
>
> Also, I wanted to know how to see which file is being submitted to 
> which TaskTracker or if that is not possible then how do I check if 
> some data transfer is happening in between my slave nodes during a MR job?
>
>
>
> Sorry for so many questions and Thank you for your time.
>
>
>
> Regards,
>
> Nikhil



--
Harsh J

RE: How to combine input files for a MapReduce job

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Hi,

@Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, May 13, 2013 1:03 PM
To: <us...@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.

For "which file is being submitted to which" question: Having
https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and 
> TaskTrackers on other two. Instead of using HDFS, I have written my 
> own FileSystem implementation. As an experiment, I kept 1000 text 
> files (all of same size) on both the slave nodes and ran a simple 
> Wordcount MR job. It took around 50 mins to complete the task. 
> Afterwards, I concatenated all the
> 1000 files into a single file and then ran a Wordcount MR job, it took 
> 35 secs. From the JobTracker UI I could make out that the problem is 
> because of the number of mappers that JobTracker is creating. For 1000 
> files it creates
> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>
>
>
> Thus, is there a way to reduce the number of mappers i.e. can I 
> control the number of mappers through some configuration parameter so 
> that Hadoop would club all the files until it reaches some specified 
> size (say, 64 MB) and then make 1 map per 64 MB block?
>
>
>
> Also, I wanted to know how to see which file is being submitted to 
> which TaskTracker or if that is not possible then how do I check if 
> some data transfer is happening in between my slave nodes during a MR job?
>
>
>
> Sorry for so many questions and Thank you for your time.
>
>
>
> Regards,
>
> Nikhil



--
Harsh J

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

For "control number of mappers" question: You can use
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
which is designed to solve similar cases. However, you cannot beat the
speed you get out of a single large file (or a few large files), as
you'll still have file open/close overheads which will bog you down.

For "which file is being submitted to which" question: Having
https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the
version/distribution of Apache Hadoop you use would help.

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two. Instead of using HDFS, I have written my own
> FileSystem implementation. As an experiment, I kept 1000 text files (all of
> same size) on both the slave nodes and ran a simple Wordcount MR job. It
> took around 50 mins to complete the task. Afterwards, I concatenated all the
> 1000 files into a single file and then ran a Wordcount MR job, it took 35
> secs. From the JobTracker UI I could make out that the problem is because of
> the number of mappers that JobTracker is creating. For 1000 files it creates
> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>
>
>
> Thus, is there a way to reduce the number of mappers i.e. can I control the
> number of mappers through some configuration parameter so that Hadoop would
> club all the files until it reaches some specified size (say, 64 MB) and
> then make 1 map per 64 MB block?
>
>
>
> Also, I wanted to know how to see which file is being submitted to which
> TaskTracker or if that is not possible then how do I check if some data
> transfer is happening in between my slave nodes during a MR job?
>
>
>
> Sorry for so many questions and Thank you for your time.
>
>
>
> Regards,
>
> Nikhil



--
Harsh J

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

Shashwat,

Tweaking the split sizes affects a single input split, not how the
splits are packed. It may be used with the CombineFileInputFormat to
control packed split sizes, but would otherwise not be of use in
"merging" processing of several blocks across files into the same map
task.

On Mon, May 13, 2013 at 1:03 PM, shashwat shriparv
<dw...@gmail.com> wrote:
> Look into mapred.max.split.size mapred.min.split.size and number of mapper
> in mapred-site.xml
>
> Thanks & Regards
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil
> <Ni...@netapp.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my own
>> FileSystem implementation. As an experiment, I kept 1000 text files (all of
>> same size) on both the slave nodes and ran a simple Wordcount MR job. It
>> took around 50 mins to complete the task. Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took 35
>> secs. From the JobTracker UI I could make out that the problem is because of
>> the number of mappers that JobTracker is creating. For 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I control
>> the number of mappers through some configuration parameter so that Hadoop
>> would club all the files until it reaches some specified size (say, 64 MB)
>> and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to which
>> TaskTracker or if that is not possible then how do I check if some data
>> transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>



-- 
Harsh J

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

Shashwat,

Tweaking the split sizes affects a single input split, not how the
splits are packed. It may be used with the CombineFileInputFormat to
control packed split sizes, but would otherwise not be of use in
"merging" processing of several blocks across files into the same map
task.

On Mon, May 13, 2013 at 1:03 PM, shashwat shriparv
<dw...@gmail.com> wrote:
> Look into mapred.max.split.size mapred.min.split.size and number of mapper
> in mapred-site.xml
>
> Thanks & Regards
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil
> <Ni...@netapp.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my own
>> FileSystem implementation. As an experiment, I kept 1000 text files (all of
>> same size) on both the slave nodes and ran a simple Wordcount MR job. It
>> took around 50 mins to complete the task. Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took 35
>> secs. From the JobTracker UI I could make out that the problem is because of
>> the number of mappers that JobTracker is creating. For 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I control
>> the number of mappers through some configuration parameter so that Hadoop
>> would club all the files until it reaches some specified size (say, 64 MB)
>> and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to which
>> TaskTracker or if that is not possible then how do I check if some data
>> transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>



-- 
Harsh J

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

Shashwat,

Tweaking the split sizes affects a single input split, not how the
splits are packed. It may be used with the CombineFileInputFormat to
control packed split sizes, but would otherwise not be of use in
"merging" processing of several blocks across files into the same map
task.

On Mon, May 13, 2013 at 1:03 PM, shashwat shriparv
<dw...@gmail.com> wrote:
> Look into mapred.max.split.size mapred.min.split.size and number of mapper
> in mapred-site.xml
>
> Thanks & Regards
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil
> <Ni...@netapp.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my own
>> FileSystem implementation. As an experiment, I kept 1000 text files (all of
>> same size) on both the slave nodes and ran a simple Wordcount MR job. It
>> took around 50 mins to complete the task. Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took 35
>> secs. From the JobTracker UI I could make out that the problem is because of
>> the number of mappers that JobTracker is creating. For 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I control
>> the number of mappers through some configuration parameter so that Hadoop
>> would club all the files until it reaches some specified size (say, 64 MB)
>> and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to which
>> TaskTracker or if that is not possible then how do I check if some data
>> transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>



-- 
Harsh J

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

Shashwat,

Tweaking the split sizes affects a single input split, not how the
splits are packed. It may be used with the CombineFileInputFormat to
control packed split sizes, but would otherwise not be of use in
"merging" processing of several blocks across files into the same map
task.

On Mon, May 13, 2013 at 1:03 PM, shashwat shriparv
<dw...@gmail.com> wrote:
> Look into mapred.max.split.size mapred.min.split.size and number of mapper
> in mapred-site.xml
>
> Thanks & Regards
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil
> <Ni...@netapp.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my own
>> FileSystem implementation. As an experiment, I kept 1000 text files (all of
>> same size) on both the slave nodes and ran a simple Wordcount MR job. It
>> took around 50 mins to complete the task. Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took 35
>> secs. From the JobTracker UI I could make out that the problem is because of
>> the number of mappers that JobTracker is creating. For 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I control
>> the number of mappers through some configuration parameter so that Hadoop
>> would club all the files until it reaches some specified size (say, 64 MB)
>> and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to which
>> TaskTracker or if that is not possible then how do I check if some data
>> transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>



-- 
Harsh J

Re: How to combine input files for a MapReduce job

Posted by shashwat shriparv <dw...@gmail.com>.

Look into mapred.max.split.size mapred.min.split.size and number of mapper
in mapred-site.xml

*Thanks & Regards    *

∞
Shashwat Shriparv



On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Nikhil.Agarwal@netapp.com
> wrote:

>  Hi,****
>
> ** **
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two. Instead of using HDFS, I have written my own
> FileSystem implementation. As an experiment, I kept 1000 text files (all of
> same size) on both the slave nodes and ran a simple Wordcount MR job. It
> took around 50 mins to complete the task. Afterwards, I concatenated all
> the 1000 files into a single file and then ran a Wordcount MR job, it took
> 35 secs. From the JobTracker UI I could make out that the problem is
> because of the number of mappers that JobTracker is creating. For 1000
> files it creates 1000 maps and for 1 file it creates 1 map (irrespective of
> file size). ****
>
> ** **
>
> Thus, is there a way to reduce the number of mappers i.e. can I control
> the number of mappers through some configuration parameter so that Hadoop
> would club all the files until it reaches some specified size (say, 64 MB)
> and then make 1 map per 64 MB block?****
>
> ** **
>
> Also, I wanted to know how to see which file is being submitted to which
> TaskTracker or if that is not possible then how do I check if some data
> transfer is happening in between my slave nodes during a MR job?****
>
> ** **
>
> Sorry for so many questions and Thank you for your time.****
>
> ** **
>
> Regards,****
>
> Nikhil****
>

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

For "control number of mappers" question: You can use
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
which is designed to solve similar cases. However, you cannot beat the
speed you get out of a single large file (or a few large files), as
you'll still have file open/close overheads which will bog you down.

For "which file is being submitted to which" question: Having
https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the
version/distribution of Apache Hadoop you use would help.

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two. Instead of using HDFS, I have written my own
> FileSystem implementation. As an experiment, I kept 1000 text files (all of
> same size) on both the slave nodes and ran a simple Wordcount MR job. It
> took around 50 mins to complete the task. Afterwards, I concatenated all the
> 1000 files into a single file and then ran a Wordcount MR job, it took 35
> secs. From the JobTracker UI I could make out that the problem is because of
> the number of mappers that JobTracker is creating. For 1000 files it creates
> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>
>
>
> Thus, is there a way to reduce the number of mappers i.e. can I control the
> number of mappers through some configuration parameter so that Hadoop would
> club all the files until it reaches some specified size (say, 64 MB) and
> then make 1 map per 64 MB block?
>
>
>
> Also, I wanted to know how to see which file is being submitted to which
> TaskTracker or if that is not possible then how do I check if some data
> transfer is happening in between my slave nodes during a MR job?
>
>
>
> Sorry for so many questions and Thank you for your time.
>
>
>
> Regards,
>
> Nikhil



--
Harsh J

Re: How to combine input files for a MapReduce job

Posted by shashwat shriparv <dw...@gmail.com>.

Look into mapred.max.split.size mapred.min.split.size and number of mapper
in mapred-site.xml

*Thanks & Regards    *

∞
Shashwat Shriparv



On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Nikhil.Agarwal@netapp.com
> wrote:

>  Hi,****
>
> ** **
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two. Instead of using HDFS, I have written my own
> FileSystem implementation. As an experiment, I kept 1000 text files (all of
> same size) on both the slave nodes and ran a simple Wordcount MR job. It
> took around 50 mins to complete the task. Afterwards, I concatenated all
> the 1000 files into a single file and then ran a Wordcount MR job, it took
> 35 secs. From the JobTracker UI I could make out that the problem is
> because of the number of mappers that JobTracker is creating. For 1000
> files it creates 1000 maps and for 1 file it creates 1 map (irrespective of
> file size). ****
>
> ** **
>
> Thus, is there a way to reduce the number of mappers i.e. can I control
> the number of mappers through some configuration parameter so that Hadoop
> would club all the files until it reaches some specified size (say, 64 MB)
> and then make 1 map per 64 MB block?****
>
> ** **
>
> Also, I wanted to know how to see which file is being submitted to which
> TaskTracker or if that is not possible then how do I check if some data
> transfer is happening in between my slave nodes during a MR job?****
>
> ** **
>
> Sorry for so many questions and Thank you for your time.****
>
> ** **
>
> Regards,****
>
> Nikhil****
>

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

For "control number of mappers" question: You can use
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
which is designed to solve similar cases. However, you cannot beat the
speed you get out of a single large file (or a few large files), as
you'll still have file open/close overheads which will bog you down.

For "which file is being submitted to which" question: Having
https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the
version/distribution of Apache Hadoop you use would help.

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two. Instead of using HDFS, I have written my own
> FileSystem implementation. As an experiment, I kept 1000 text files (all of
> same size) on both the slave nodes and ran a simple Wordcount MR job. It
> took around 50 mins to complete the task. Afterwards, I concatenated all the
> 1000 files into a single file and then ran a Wordcount MR job, it took 35
> secs. From the JobTracker UI I could make out that the problem is because of
> the number of mappers that JobTracker is creating. For 1000 files it creates
> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>
>
>
> Thus, is there a way to reduce the number of mappers i.e. can I control the
> number of mappers through some configuration parameter so that Hadoop would
> club all the files until it reaches some specified size (say, 64 MB) and
> then make 1 map per 64 MB block?
>
>
>
> Also, I wanted to know how to see which file is being submitted to which
> TaskTracker or if that is not possible then how do I check if some data
> transfer is happening in between my slave nodes during a MR job?
>
>
>
> Sorry for so many questions and Thank you for your time.
>
>
>
> Regards,
>
> Nikhil



--
Harsh J

Re: How to combine input files for a MapReduce job

Posted by shashwat shriparv <dw...@gmail.com>.

Look into mapred.max.split.size mapred.min.split.size and number of mapper
in mapred-site.xml

*Thanks & Regards    *

∞
Shashwat Shriparv



On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Nikhil.Agarwal@netapp.com
> wrote:

>  Hi,****
>
> ** **
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two. Instead of using HDFS, I have written my own
> FileSystem implementation. As an experiment, I kept 1000 text files (all of
> same size) on both the slave nodes and ran a simple Wordcount MR job. It
> took around 50 mins to complete the task. Afterwards, I concatenated all
> the 1000 files into a single file and then ran a Wordcount MR job, it took
> 35 secs. From the JobTracker UI I could make out that the problem is
> because of the number of mappers that JobTracker is creating. For 1000
> files it creates 1000 maps and for 1 file it creates 1 map (irrespective of
> file size). ****
>
> ** **
>
> Thus, is there a way to reduce the number of mappers i.e. can I control
> the number of mappers through some configuration parameter so that Hadoop
> would club all the files until it reaches some specified size (say, 64 MB)
> and then make 1 map per 64 MB block?****
>
> ** **
>
> Also, I wanted to know how to see which file is being submitted to which
> TaskTracker or if that is not possible then how do I check if some data
> transfer is happening in between my slave nodes during a MR job?****
>
> ** **
>
> Sorry for so many questions and Thank you for your time.****
>
> ** **
>
> Regards,****
>
> Nikhil****
>

Re: How to combine input files for a MapReduce job

Posted by Harsh J <ha...@cloudera.com>.

For "control number of mappers" question: You can use
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
which is designed to solve similar cases. However, you cannot beat the
speed you get out of a single large file (or a few large files), as
you'll still have file open/close overheads which will bog you down.

For "which file is being submitted to which" question: Having
https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the
version/distribution of Apache Hadoop you use would help.

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two. Instead of using HDFS, I have written my own
> FileSystem implementation. As an experiment, I kept 1000 text files (all of
> same size) on both the slave nodes and ran a simple Wordcount MR job. It
> took around 50 mins to complete the task. Afterwards, I concatenated all the
> 1000 files into a single file and then ran a Wordcount MR job, it took 35
> secs. From the JobTracker UI I could make out that the problem is because of
> the number of mappers that JobTracker is creating. For 1000 files it creates
> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>
>
>
> Thus, is there a way to reduce the number of mappers i.e. can I control the
> number of mappers through some configuration parameter so that Hadoop would
> club all the files until it reaches some specified size (say, 64 MB) and
> then make 1 map per 64 MB block?
>
>
>
> Also, I wanted to know how to see which file is being submitted to which
> TaskTracker or if that is not possible then how do I check if some data
> transfer is happening in between my slave nodes during a MR job?
>
>
>
> Sorry for so many questions and Thank you for your time.
>
>
>
> Regards,
>
> Nikhil



--
Harsh J

Re: How to combine input files for a MapReduce job

Posted by shashwat shriparv <dw...@gmail.com>.

Look into mapred.max.split.size mapred.min.split.size and number of mapper
in mapred-site.xml

*Thanks & Regards    *

∞
Shashwat Shriparv



On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Nikhil.Agarwal@netapp.com
> wrote:

>  Hi,****
>
> ** **
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two. Instead of using HDFS, I have written my own
> FileSystem implementation. As an experiment, I kept 1000 text files (all of
> same size) on both the slave nodes and ran a simple Wordcount MR job. It
> took around 50 mins to complete the task. Afterwards, I concatenated all
> the 1000 files into a single file and then ran a Wordcount MR job, it took
> 35 secs. From the JobTracker UI I could make out that the problem is
> because of the number of mappers that JobTracker is creating. For 1000
> files it creates 1000 maps and for 1 file it creates 1 map (irrespective of
> file size). ****
>
> ** **
>
> Thus, is there a way to reduce the number of mappers i.e. can I control
> the number of mappers through some configuration parameter so that Hadoop
> would club all the files until it reaches some specified size (say, 64 MB)
> and then make 1 map per 64 MB block?****
>
> ** **
>
> Also, I wanted to know how to see which file is being submitted to which
> TaskTracker or if that is not possible then how do I check if some data
> transfer is happening in between my slave nodes during a MR job?****
>
> ** **
>
> Sorry for so many questions and Thank you for your time.****
>
> ** **
>
> Regards,****
>
> Nikhil****
>