You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Gaurav Dasgupta <gd...@gmail.com> on 2012/08/16 16:13:07 UTC

Number of Maps running more than expected

Hi users,

I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
the 12 nodes and 1 node running the Job Tracker).
In order to perform a WordCount benchmark test, I did the following:

   - Executed "RandomTextWriter" first to create 100 GB data (Note that I
   have changed the "test.randomtextwrite.total_bytes" parameter only, rest
   all are kept default).
   - Next, executed the "WordCount" program for that 100 GB dataset.

The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my
calculation, total number of Maps to be executed by the wordcount job
should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
But when I am executing the job, it is running a total number of 900 Maps,
i.e., 100 extra.
So, why this extra number of Maps? Although, my job is completing
successfully without any error.

Again, if I don't execute the "RandomTextWwriter" job to create data for my
wordcount, rather I put my own 100 GB text file in HDFS and run
"WordCount", I can then see the number of Maps are equivalent to my
calculation, i.e., 800.

Can anyone tell me why this odd behaviour of Hadoop regarding the number of
Maps for WordCount only when the dataset is generated by RandomTextWriter?
And what is the purpose of these extra number of Maps?

Regards,
Gaurav Dasgupta

Re: Number of Maps running more than expected

Posted by Mohit Anchlia <mo...@gmail.com>.
It would be helpful to see some statistics out of both the jobs like bytes
read, written number of errors etc.

On Thu, Aug 16, 2012 at 8:02 PM, Raj Vishwanathan <ra...@yahoo.com> wrote:

> You probably have speculative execution on. Extra maps and reduce tasks
> are run in case some of them fail
>
> Raj
>
>
> Sent from my iPad
> Please excuse the typos.
>
> On Aug 16, 2012, at 11:36 AM, "in.abdul" <in...@gmail.com> wrote:
>
> > Hi Gaurav,
> >   Number map is not depents upon number block . It is really depends upon
> > number of input splits . If you had 100GB of data and you had 10 split
> > means then you can see only 10 maps .
> >
> > Please correct me if i am wrong
> >
> > Thanks and regards,
> > Syed abdul kather
> > On Aug 16, 2012 7:44 PM, "Gaurav Dasgupta [via Lucene]" <
> > ml-node+s472066n4001631h87@n3.nabble.com> wrote:
> >
> >> Hi users,
> >>
> >> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
> >> the 12 nodes and 1 node running the Job Tracker).
> >> In order to perform a WordCount benchmark test, I did the following:
> >>
> >>   - Executed "RandomTextWriter" first to create 100 GB data (Note that I
> >>   have changed the "test.randomtextwrite.total_bytes" parameter only,
> rest
> >>   all are kept default).
> >>   - Next, executed the "WordCount" program for that 100 GB dataset.
> >>
> >> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to
> my
> >> calculation, total number of Maps to be executed by the wordcount job
> >> should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> >> But when I am executing the job, it is running a total number of 900
> Maps,
> >> i.e., 100 extra.
> >> So, why this extra number of Maps? Although, my job is completing
> >> successfully without any error.
> >>
> >> Again, if I don't execute the "RandomTextWwriter" job to create data for
> >> my wordcount, rather I put my own 100 GB text file in HDFS and run
> >> "WordCount", I can then see the number of Maps are equivalent to my
> >> calculation, i.e., 800.
> >>
> >> Can anyone tell me why this odd behaviour of Hadoop regarding the number
> >> of Maps for WordCount only when the dataset is generated by
> >> RandomTextWriter? And what is the purpose of these extra number of Maps?
> >>
> >> Regards,
> >> Gaurav Dasgupta
> >>
> >>
> >> ------------------------------
> >> If you reply to this email, your message will be added to the discussion
> >> below:
> >>
> >>
> http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html
> >> To unsubscribe from Lucene, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
> >
> >> .
> >> NAML<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >
> >
> >
> >
> > -----
> > THANKS AND REGARDS,
> > SYED ABDUL KATHER
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html
> > Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>

Re: Number of Maps running more than expected

Posted by Raj Vishwanathan <ra...@yahoo.com>.
You probably have speculative execution on. Extra maps and reduce tasks are run in case some of them fail

Raj


Sent from my iPad
Please excuse the typos. 

On Aug 16, 2012, at 11:36 AM, "in.abdul" <in...@gmail.com> wrote:

> Hi Gaurav,
>   Number map is not depents upon number block . It is really depends upon
> number of input splits . If you had 100GB of data and you had 10 split
> means then you can see only 10 maps .
> 
> Please correct me if i am wrong
> 
> Thanks and regards,
> Syed abdul kather
> On Aug 16, 2012 7:44 PM, "Gaurav Dasgupta [via Lucene]" <
> ml-node+s472066n4001631h87@n3.nabble.com> wrote:
> 
>> Hi users,
>> 
>> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
>> the 12 nodes and 1 node running the Job Tracker).
>> In order to perform a WordCount benchmark test, I did the following:
>> 
>>   - Executed "RandomTextWriter" first to create 100 GB data (Note that I
>>   have changed the "test.randomtextwrite.total_bytes" parameter only, rest
>>   all are kept default).
>>   - Next, executed the "WordCount" program for that 100 GB dataset.
>> 
>> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my
>> calculation, total number of Maps to be executed by the wordcount job
>> should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
>> But when I am executing the job, it is running a total number of 900 Maps,
>> i.e., 100 extra.
>> So, why this extra number of Maps? Although, my job is completing
>> successfully without any error.
>> 
>> Again, if I don't execute the "RandomTextWwriter" job to create data for
>> my wordcount, rather I put my own 100 GB text file in HDFS and run
>> "WordCount", I can then see the number of Maps are equivalent to my
>> calculation, i.e., 800.
>> 
>> Can anyone tell me why this odd behaviour of Hadoop regarding the number
>> of Maps for WordCount only when the dataset is generated by
>> RandomTextWriter? And what is the purpose of these extra number of Maps?
>> 
>> Regards,
>> Gaurav Dasgupta
>> 
>> 
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> 
>> http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html
>> To unsubscribe from Lucene, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw>
>> .
>> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>> 
> 
> 
> 
> 
> -----
> THANKS AND REGARDS,
> SYED ABDUL KATHER
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Number of Maps running more than expected

Posted by Bertrand Dechoux <de...@gmail.com>.
Also could you tell us more about your task statuses?
You might also have failed tasks...


Bertrand

On Thu, Aug 16, 2012 at 11:01 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Well, there is speculative executions too.
>
> http://developer.yahoo.com/hadoop/tutorial/module4.html
>
> *Speculative execution:* One problem with the Hadoop system is that by
>> dividing the tasks across many nodes, it is possible for a few slow nodes
>> to rate-limit the rest of the program. For example if one node has a slow
>> disk controller, then it may be reading its input at only 10% the speed of
>> all the other nodes. So when 99 map tasks are already complete, the system
>> is still waiting for the final map task to check in, which takes much
>> longer than all the other nodes.
>> By forcing tasks to run in isolation from one another, individual tasks
>> do not know *where* their inputs come from. Tasks trust the Hadoop
>> platform to just deliver the appropriate input. Therefore, the same input
>> can be processed *multiple times in parallel*, to exploit differences in
>> machine capabilities. As most of the tasks in a job are coming to a close,
>> the Hadoop platform will schedule redundant copies of the remaining tasks
>> across several nodes which do not have other work to perform. This process
>> is known as *speculative execution*. When tasks complete, they announce
>> this fact to the JobTracker. Whichever copy of a task finishes first
>> becomes the definitive copy. If other copies were executing speculatively,
>> Hadoop tells the TaskTrackers to abandon the tasks and discard their
>> outputs. The Reducers then receive their inputs from whichever Mapper
>> completed successfully, first.
>> Speculative execution is enabled by default. You can disable speculative
>> execution for the mappers and reducers by setting the
>> mapred.map.tasks.speculative.execution and
>> mapred.reduce.tasks.speculative.execution JobConf options to false,
>> respectively.
>
>
>
> Can you tell us your configuration with regards to those parameters?
>
> Regards
>
> Bertrand
>
> On Thu, Aug 16, 2012 at 8:36 PM, in.abdul <in...@gmail.com> wrote:
>
>> Hi Gaurav,
>>    Number map is not depents upon number block . It is really depends upon
>> number of input splits . If you had 100GB of data and you had 10 split
>> means then you can see only 10 maps .
>>
>> Please correct me if i am wrong
>>
>> Thanks and regards,
>> Syed abdul kather
>> On Aug 16, 2012 7:44 PM, "Gaurav Dasgupta [via Lucene]" <
>> ml-node+s472066n4001631h87@n3.nabble.com> wrote:
>>
>> > Hi users,
>> >
>> > I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
>> > the 12 nodes and 1 node running the Job Tracker).
>> > In order to perform a WordCount benchmark test, I did the following:
>> >
>> >    - Executed "RandomTextWriter" first to create 100 GB data (Note that
>> I
>> >    have changed the "test.randomtextwrite.total_bytes" parameter only,
>> rest
>> >    all are kept default).
>> >    - Next, executed the "WordCount" program for that 100 GB dataset.
>> >
>> > The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to
>> my
>> > calculation, total number of Maps to be executed by the wordcount job
>> > should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
>> > But when I am executing the job, it is running a total number of 900
>> Maps,
>> > i.e., 100 extra.
>> > So, why this extra number of Maps? Although, my job is completing
>> > successfully without any error.
>> >
>> > Again, if I don't execute the "RandomTextWwriter" job to create data for
>> > my wordcount, rather I put my own 100 GB text file in HDFS and run
>> > "WordCount", I can then see the number of Maps are equivalent to my
>> > calculation, i.e., 800.
>> >
>> > Can anyone tell me why this odd behaviour of Hadoop regarding the number
>> > of Maps for WordCount only when the dataset is generated by
>> > RandomTextWriter? And what is the purpose of these extra number of Maps?
>> >
>> > Regards,
>> > Gaurav Dasgupta
>> >
>> >
>> > ------------------------------
>> >  If you reply to this email, your message will be added to the
>> discussion
>> > below:
>> >
>> >
>> http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html
>> >  To unsubscribe from Lucene, click here<
>> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
>> >
>> > .
>> > NAML<
>> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>> >
>> >
>>
>>
>>
>>
>> -----
>> THANKS AND REGARDS,
>> SYED ABDUL KATHER
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html
>> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>
>
>
>
> --
> Bertrand Dechoux
>



-- 
Bertrand Dechoux

Re: Number of Maps running more than expected

Posted by Bertrand Dechoux <de...@gmail.com>.
Well, there is speculative executions too.

http://developer.yahoo.com/hadoop/tutorial/module4.html

*Speculative execution:* One problem with the Hadoop system is that by
> dividing the tasks across many nodes, it is possible for a few slow nodes
> to rate-limit the rest of the program. For example if one node has a slow
> disk controller, then it may be reading its input at only 10% the speed of
> all the other nodes. So when 99 map tasks are already complete, the system
> is still waiting for the final map task to check in, which takes much
> longer than all the other nodes.
> By forcing tasks to run in isolation from one another, individual tasks do
> not know *where* their inputs come from. Tasks trust the Hadoop platform
> to just deliver the appropriate input. Therefore, the same input can be
> processed *multiple times in parallel*, to exploit differences in machine
> capabilities. As most of the tasks in a job are coming to a close, the
> Hadoop platform will schedule redundant copies of the remaining tasks
> across several nodes which do not have other work to perform. This process
> is known as *speculative execution*. When tasks complete, they announce
> this fact to the JobTracker. Whichever copy of a task finishes first
> becomes the definitive copy. If other copies were executing speculatively,
> Hadoop tells the TaskTrackers to abandon the tasks and discard their
> outputs. The Reducers then receive their inputs from whichever Mapper
> completed successfully, first.
> Speculative execution is enabled by default. You can disable speculative
> execution for the mappers and reducers by setting the
> mapred.map.tasks.speculative.execution and
> mapred.reduce.tasks.speculative.execution JobConf options to false,
> respectively.



Can you tell us your configuration with regards to those parameters?

Regards

Bertrand

On Thu, Aug 16, 2012 at 8:36 PM, in.abdul <in...@gmail.com> wrote:

> Hi Gaurav,
>    Number map is not depents upon number block . It is really depends upon
> number of input splits . If you had 100GB of data and you had 10 split
> means then you can see only 10 maps .
>
> Please correct me if i am wrong
>
> Thanks and regards,
> Syed abdul kather
> On Aug 16, 2012 7:44 PM, "Gaurav Dasgupta [via Lucene]" <
> ml-node+s472066n4001631h87@n3.nabble.com> wrote:
>
> > Hi users,
> >
> > I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
> > the 12 nodes and 1 node running the Job Tracker).
> > In order to perform a WordCount benchmark test, I did the following:
> >
> >    - Executed "RandomTextWriter" first to create 100 GB data (Note that I
> >    have changed the "test.randomtextwrite.total_bytes" parameter only,
> rest
> >    all are kept default).
> >    - Next, executed the "WordCount" program for that 100 GB dataset.
> >
> > The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to
> my
> > calculation, total number of Maps to be executed by the wordcount job
> > should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> > But when I am executing the job, it is running a total number of 900
> Maps,
> > i.e., 100 extra.
> > So, why this extra number of Maps? Although, my job is completing
> > successfully without any error.
> >
> > Again, if I don't execute the "RandomTextWwriter" job to create data for
> > my wordcount, rather I put my own 100 GB text file in HDFS and run
> > "WordCount", I can then see the number of Maps are equivalent to my
> > calculation, i.e., 800.
> >
> > Can anyone tell me why this odd behaviour of Hadoop regarding the number
> > of Maps for WordCount only when the dataset is generated by
> > RandomTextWriter? And what is the purpose of these extra number of Maps?
> >
> > Regards,
> > Gaurav Dasgupta
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html
> >  To unsubscribe from Lucene, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
> >
> > .
> > NAML<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> -----
> THANKS AND REGARDS,
> SYED ABDUL KATHER
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




-- 
Bertrand Dechoux

Re: Number of Maps running more than expected

Posted by "in.abdul" <in...@gmail.com>.
Hi Gaurav,
   Number map is not depents upon number block . It is really depends upon
number of input splits . If you had 100GB of data and you had 10 split
means then you can see only 10 maps .

Please correct me if i am wrong

Thanks and regards,
Syed abdul kather
On Aug 16, 2012 7:44 PM, "Gaurav Dasgupta [via Lucene]" <
ml-node+s472066n4001631h87@n3.nabble.com> wrote:

> Hi users,
>
> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
> the 12 nodes and 1 node running the Job Tracker).
> In order to perform a WordCount benchmark test, I did the following:
>
>    - Executed "RandomTextWriter" first to create 100 GB data (Note that I
>    have changed the "test.randomtextwrite.total_bytes" parameter only, rest
>    all are kept default).
>    - Next, executed the "WordCount" program for that 100 GB dataset.
>
> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my
> calculation, total number of Maps to be executed by the wordcount job
> should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> But when I am executing the job, it is running a total number of 900 Maps,
> i.e., 100 extra.
> So, why this extra number of Maps? Although, my job is completing
> successfully without any error.
>
> Again, if I don't execute the "RandomTextWwriter" job to create data for
> my wordcount, rather I put my own 100 GB text file in HDFS and run
> "WordCount", I can then see the number of Maps are equivalent to my
> calculation, i.e., 800.
>
> Can anyone tell me why this odd behaviour of Hadoop regarding the number
> of Maps for WordCount only when the dataset is generated by
> RandomTextWriter? And what is the purpose of these extra number of Maps?
>
> Regards,
> Gaurav Dasgupta
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html
>  To unsubscribe from Lucene, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




-----
THANKS AND REGARDS,
SYED ABDUL KATHER
--
View this message in context: http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi

I have got it. It was my mistake understanding the calculation. Thanks for
the help.

Regards,
Gaurav Dasgupta

On Fri, Aug 17, 2012 at 5:15 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Gaurav
>
> While calculating you got the number of map tasks per file as 8.12 ie 9
> map tasks for each file. So for 100 files it is 900 map tasks and now your
> numbers match. Doesn't it look right?
>
> Regards
> Bejoy KS
>

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi

I have got it. It was my mistake understanding the calculation. Thanks for
the help.

Regards,
Gaurav Dasgupta

On Fri, Aug 17, 2012 at 5:15 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Gaurav
>
> While calculating you got the number of map tasks per file as 8.12 ie 9
> map tasks for each file. So for 100 files it is 900 map tasks and now your
> numbers match. Doesn't it look right?
>
> Regards
> Bejoy KS
>

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi

I have got it. It was my mistake understanding the calculation. Thanks for
the help.

Regards,
Gaurav Dasgupta

On Fri, Aug 17, 2012 at 5:15 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Gaurav
>
> While calculating you got the number of map tasks per file as 8.12 ie 9
> map tasks for each file. So for 100 files it is 900 map tasks and now your
> numbers match. Doesn't it look right?
>
> Regards
> Bejoy KS
>

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi

I have got it. It was my mistake understanding the calculation. Thanks for
the help.

Regards,
Gaurav Dasgupta

On Fri, Aug 17, 2012 at 5:15 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Gaurav
>
> While calculating you got the number of map tasks per file as 8.12 ie 9
> map tasks for each file. So for 100 files it is 900 map tasks and now your
> numbers match. Doesn't it look right?
>
> Regards
> Bejoy KS
>

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

While calculating you got the number of map tasks per file as 8.12 ie 9 map
tasks for each file. So for 100 files it is 900 map tasks and now your
numbers match. Doesn't it look right?

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

While calculating you got the number of map tasks per file as 8.12 ie 9 map
tasks for each file. So for 100 files it is 900 map tasks and now your
numbers match. Doesn't it look right?

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

While calculating you got the number of map tasks per file as 8.12 ie 9 map
tasks for each file. So for 100 files it is 900 map tasks and now your
numbers match. Doesn't it look right?

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

While calculating you got the number of map tasks per file as 8.12 ie 9 map
tasks for each file. So for 100 files it is 900 map tasks and now your
numbers match. Doesn't it look right?

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi Bejoy,

The total number of Maps in the RandomTextWriter execution were 100 and
hence the total number of input files for WordCount are 100.
My dfs.block.size = 128MB and I have not changed the
mapred.max.split.size and could not find it in myJob.xml file.
Hence refering the formula *max(minsplitsize, min(maxsplitsize, blocksize))*,
I am assuming the mapred.max.split.size to be 128MB.
If I calculate the blocks per file [bytes per file / block size (128 MB)]
gives me 8.21 for all. And then if I sum up them it becomes 821.22 (Same as
my previous calculation).

I have some how managed to do a need copy of the Job.xml in a word doc. I
copied it from browser as I cannot recover it in the hdfs. Please find it
in the attachment. You may refer the parameters and configuration there. I
have also attached the console output for the bytes per file in the
WordCount input.

Regards,
Gaurav Dasgupta
On Fri, Aug 17, 2012 at 3:28 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Gaurav
>
> To add on more clarity to my previous mail
> If you are using the default TextInputFormat there will be *atleast* one
> task generated per file even if the file size is less than
> the block size. (assuming you have split size equal to block size)
>
> So the right way to calculate the number of splits is per file and not on
> the whole input data size. Calculate number of blocks per file and summing
> up those values from all files would equate to the number of mappers.
>
> What is the value of mapred.max.splitsize in your job? If it is less than
> the hdfs block size there will be more spits for even for a hdfs block.
>
> Regards
> Bejoy KS
>
>
>
>

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi Bejoy,

The total number of Maps in the RandomTextWriter execution were 100 and
hence the total number of input files for WordCount are 100.
My dfs.block.size = 128MB and I have not changed the
mapred.max.split.size and could not find it in myJob.xml file.
Hence refering the formula *max(minsplitsize, min(maxsplitsize, blocksize))*,
I am assuming the mapred.max.split.size to be 128MB.
If I calculate the blocks per file [bytes per file / block size (128 MB)]
gives me 8.21 for all. And then if I sum up them it becomes 821.22 (Same as
my previous calculation).

I have some how managed to do a need copy of the Job.xml in a word doc. I
copied it from browser as I cannot recover it in the hdfs. Please find it
in the attachment. You may refer the parameters and configuration there. I
have also attached the console output for the bytes per file in the
WordCount input.

Regards,
Gaurav Dasgupta
On Fri, Aug 17, 2012 at 3:28 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Gaurav
>
> To add on more clarity to my previous mail
> If you are using the default TextInputFormat there will be *atleast* one
> task generated per file even if the file size is less than
> the block size. (assuming you have split size equal to block size)
>
> So the right way to calculate the number of splits is per file and not on
> the whole input data size. Calculate number of blocks per file and summing
> up those values from all files would equate to the number of mappers.
>
> What is the value of mapred.max.splitsize in your job? If it is less than
> the hdfs block size there will be more spits for even for a hdfs block.
>
> Regards
> Bejoy KS
>
>
>
>

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi Bejoy,

The total number of Maps in the RandomTextWriter execution were 100 and
hence the total number of input files for WordCount are 100.
My dfs.block.size = 128MB and I have not changed the
mapred.max.split.size and could not find it in myJob.xml file.
Hence refering the formula *max(minsplitsize, min(maxsplitsize, blocksize))*,
I am assuming the mapred.max.split.size to be 128MB.
If I calculate the blocks per file [bytes per file / block size (128 MB)]
gives me 8.21 for all. And then if I sum up them it becomes 821.22 (Same as
my previous calculation).

I have some how managed to do a need copy of the Job.xml in a word doc. I
copied it from browser as I cannot recover it in the hdfs. Please find it
in the attachment. You may refer the parameters and configuration there. I
have also attached the console output for the bytes per file in the
WordCount input.

Regards,
Gaurav Dasgupta
On Fri, Aug 17, 2012 at 3:28 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Gaurav
>
> To add on more clarity to my previous mail
> If you are using the default TextInputFormat there will be *atleast* one
> task generated per file even if the file size is less than
> the block size. (assuming you have split size equal to block size)
>
> So the right way to calculate the number of splits is per file and not on
> the whole input data size. Calculate number of blocks per file and summing
> up those values from all files would equate to the number of mappers.
>
> What is the value of mapred.max.splitsize in your job? If it is less than
> the hdfs block size there will be more spits for even for a hdfs block.
>
> Regards
> Bejoy KS
>
>
>
>

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi Bejoy,

The total number of Maps in the RandomTextWriter execution were 100 and
hence the total number of input files for WordCount are 100.
My dfs.block.size = 128MB and I have not changed the
mapred.max.split.size and could not find it in myJob.xml file.
Hence refering the formula *max(minsplitsize, min(maxsplitsize, blocksize))*,
I am assuming the mapred.max.split.size to be 128MB.
If I calculate the blocks per file [bytes per file / block size (128 MB)]
gives me 8.21 for all. And then if I sum up them it becomes 821.22 (Same as
my previous calculation).

I have some how managed to do a need copy of the Job.xml in a word doc. I
copied it from browser as I cannot recover it in the hdfs. Please find it
in the attachment. You may refer the parameters and configuration there. I
have also attached the console output for the bytes per file in the
WordCount input.

Regards,
Gaurav Dasgupta
On Fri, Aug 17, 2012 at 3:28 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Gaurav
>
> To add on more clarity to my previous mail
> If you are using the default TextInputFormat there will be *atleast* one
> task generated per file even if the file size is less than
> the block size. (assuming you have split size equal to block size)
>
> So the right way to calculate the number of splits is per file and not on
> the whole input data size. Calculate number of blocks per file and summing
> up those values from all files would equate to the number of mappers.
>
> What is the value of mapred.max.splitsize in your job? If it is less than
> the hdfs block size there will be more spits for even for a hdfs block.
>
> Regards
> Bejoy KS
>
>
>
>

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

To add on more clarity to my previous mail
If you are using the default TextInputFormat there will be *atleast* one
task generated per file even if the file size is less than
the block size. (assuming you have split size equal to block size)

So the right way to calculate the number of splits is per file and not on
the whole input data size. Calculate number of blocks per file and summing
up those values from all files would equate to the number of mappers.

What is the value of mapred.max.splitsize in your job? If it is less than
the hdfs block size there will be more spits for even for a hdfs block.

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

To add on more clarity to my previous mail
If you are using the default TextInputFormat there will be *atleast* one
task generated per file even if the file size is less than
the block size. (assuming you have split size equal to block size)

So the right way to calculate the number of splits is per file and not on
the whole input data size. Calculate number of blocks per file and summing
up those values from all files would equate to the number of mappers.

What is the value of mapred.max.splitsize in your job? If it is less than
the hdfs block size there will be more spits for even for a hdfs block.

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

To add on more clarity to my previous mail
If you are using the default TextInputFormat there will be *atleast* one
task generated per file even if the file size is less than
the block size. (assuming you have split size equal to block size)

So the right way to calculate the number of splits is per file and not on
the whole input data size. Calculate number of blocks per file and summing
up those values from all files would equate to the number of mappers.

What is the value of mapred.max.splitsize in your job? If it is less than
the hdfs block size there will be more spits for even for a hdfs block.

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

To add on more clarity to my previous mail
If you are using the default TextInputFormat there will be *atleast* one
task generated per file even if the file size is less than
the block size. (assuming you have split size equal to block size)

So the right way to calculate the number of splits is per file and not on
the whole input data size. Calculate number of blocks per file and summing
up those values from all files would equate to the number of mappers.

What is the value of mapred.max.splitsize in your job? If it is less than
the hdfs block size there will be more spits for even for a hdfs block.

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

How many input files are there for the wordcount map reduce job? Do you
have input files lesser than a block size? If you are using the default
TextInputFormat there will be one task generated per file for sure, so if
you have  files less than block size the calculation specified here for
number of splits won't hold. If small files are there then definitely the
number of maps tasks should be more.

Also did you change the split sizes as well along with block size?

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

How many input files are there for the wordcount map reduce job? Do you
have input files lesser than a block size? If you are using the default
TextInputFormat there will be one task generated per file for sure, so if
you have  files less than block size the calculation specified here for
number of splits won't hold. If small files are there then definitely the
number of maps tasks should be more.

Also did you change the split sizes as well along with block size?

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

How many input files are there for the wordcount map reduce job? Do you
have input files lesser than a block size? If you are using the default
TextInputFormat there will be one task generated per file for sure, so if
you have  files less than block size the calculation specified here for
number of splits won't hold. If small files are there then definitely the
number of maps tasks should be more.

Also did you change the split sizes as well along with block size?

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Bejoy Ks <be...@gmail.com>.
Hi Gaurav

How many input files are there for the wordcount map reduce job? Do you
have input files lesser than a block size? If you are using the default
TextInputFormat there will be one task generated per file for sure, so if
you have  files less than block size the calculation specified here for
number of splits won't hold. If small files are there then definitely the
number of maps tasks should be more.

Also did you change the split sizes as well along with block size?

Regards
Bejoy KS

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi Anil,

The speculative execution property is off from the begining.
In addition to my previous mail, I would like to add some more points:

I have checked the same while generating 100 GB data from RandomTextWriter,
where "hadoop fs -dus <hdfs output dir>" gives me 102.65 GB.
So if I calculate the number of Maps while running WordCount on this should
be ((102.65 * 1024) MB / 128 MB) = 821.22. So, there should be 822 Maps
running, but the actual number of Maps running are 900, i.e., extra 78.
Also, the number of Maps ran for the above RandomTextWriter were 100.

Above is not the same if I generate data using TeraGen (hadoop jar
hadoop-examples.jar teragen 10000 <output_dir>) and then perform WordCount
on it, it gives me the number of Maps = 2 (Note: Number of Maps for TeraGen
was also 2).
Please find in the attachement the screenshots of the JobTracker UI for
RandomTextWriter and WordCount for your reference.

Regards,
Gaurav Dasgupta
On Thu, Aug 16, 2012 at 7:57 PM, Anil Gupta <an...@gmail.com> wrote:

>  Hi Gaurav,
>
> Did you turn off speculative execution?
>
> Best Regards,
> Anil
>
> On Aug 16, 2012, at 7:13 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:
>
>   Hi users,
>
> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
> the 12 nodes and 1 node running the Job Tracker).
> In order to perform a WordCount benchmark test, I did the following:
>
>    - Executed "RandomTextWriter" first to create 100 GB data (Note that I
>    have changed the "test.randomtextwrite.total_bytes" parameter only, rest
>    all are kept default).
>    - Next, executed the "WordCount" program for that 100 GB dataset.
>
> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my
> calculation, total number of Maps to be executed by the wordcount job
> should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> But when I am executing the job, it is running a total number of 900 Maps,
> i.e., 100 extra.
> So, why this extra number of Maps? Although, my job is completing
> successfully without any error.
>
> Again, if I don't execute the "RandomTextWwriter" job to create data for
> my wordcount, rather I put my own 100 GB text file in HDFS and run
> "WordCount", I can then see the number of Maps are equivalent to my
> calculation, i.e., 800.
>
> Can anyone tell me why this odd behaviour of Hadoop regarding the number
> of Maps for WordCount only when the dataset is generated by
> RandomTextWriter? And what is the purpose of these extra number of Maps?
>
> Regards,
> Gaurav Dasgupta
>
>

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi Anil,

The speculative execution property is off from the begining.
In addition to my previous mail, I would like to add some more points:

I have checked the same while generating 100 GB data from RandomTextWriter,
where "hadoop fs -dus <hdfs output dir>" gives me 102.65 GB.
So if I calculate the number of Maps while running WordCount on this should
be ((102.65 * 1024) MB / 128 MB) = 821.22. So, there should be 822 Maps
running, but the actual number of Maps running are 900, i.e., extra 78.
Also, the number of Maps ran for the above RandomTextWriter were 100.

Above is not the same if I generate data using TeraGen (hadoop jar
hadoop-examples.jar teragen 10000 <output_dir>) and then perform WordCount
on it, it gives me the number of Maps = 2 (Note: Number of Maps for TeraGen
was also 2).
Please find in the attachement the screenshots of the JobTracker UI for
RandomTextWriter and WordCount for your reference.

Regards,
Gaurav Dasgupta
On Thu, Aug 16, 2012 at 7:57 PM, Anil Gupta <an...@gmail.com> wrote:

>  Hi Gaurav,
>
> Did you turn off speculative execution?
>
> Best Regards,
> Anil
>
> On Aug 16, 2012, at 7:13 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:
>
>   Hi users,
>
> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
> the 12 nodes and 1 node running the Job Tracker).
> In order to perform a WordCount benchmark test, I did the following:
>
>    - Executed "RandomTextWriter" first to create 100 GB data (Note that I
>    have changed the "test.randomtextwrite.total_bytes" parameter only, rest
>    all are kept default).
>    - Next, executed the "WordCount" program for that 100 GB dataset.
>
> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my
> calculation, total number of Maps to be executed by the wordcount job
> should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> But when I am executing the job, it is running a total number of 900 Maps,
> i.e., 100 extra.
> So, why this extra number of Maps? Although, my job is completing
> successfully without any error.
>
> Again, if I don't execute the "RandomTextWwriter" job to create data for
> my wordcount, rather I put my own 100 GB text file in HDFS and run
> "WordCount", I can then see the number of Maps are equivalent to my
> calculation, i.e., 800.
>
> Can anyone tell me why this odd behaviour of Hadoop regarding the number
> of Maps for WordCount only when the dataset is generated by
> RandomTextWriter? And what is the purpose of these extra number of Maps?
>
> Regards,
> Gaurav Dasgupta
>
>

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi Anil,

The speculative execution property is off from the begining.
In addition to my previous mail, I would like to add some more points:

I have checked the same while generating 100 GB data from RandomTextWriter,
where "hadoop fs -dus <hdfs output dir>" gives me 102.65 GB.
So if I calculate the number of Maps while running WordCount on this should
be ((102.65 * 1024) MB / 128 MB) = 821.22. So, there should be 822 Maps
running, but the actual number of Maps running are 900, i.e., extra 78.
Also, the number of Maps ran for the above RandomTextWriter were 100.

Above is not the same if I generate data using TeraGen (hadoop jar
hadoop-examples.jar teragen 10000 <output_dir>) and then perform WordCount
on it, it gives me the number of Maps = 2 (Note: Number of Maps for TeraGen
was also 2).
Please find in the attachement the screenshots of the JobTracker UI for
RandomTextWriter and WordCount for your reference.

Regards,
Gaurav Dasgupta
On Thu, Aug 16, 2012 at 7:57 PM, Anil Gupta <an...@gmail.com> wrote:

>  Hi Gaurav,
>
> Did you turn off speculative execution?
>
> Best Regards,
> Anil
>
> On Aug 16, 2012, at 7:13 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:
>
>   Hi users,
>
> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
> the 12 nodes and 1 node running the Job Tracker).
> In order to perform a WordCount benchmark test, I did the following:
>
>    - Executed "RandomTextWriter" first to create 100 GB data (Note that I
>    have changed the "test.randomtextwrite.total_bytes" parameter only, rest
>    all are kept default).
>    - Next, executed the "WordCount" program for that 100 GB dataset.
>
> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my
> calculation, total number of Maps to be executed by the wordcount job
> should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> But when I am executing the job, it is running a total number of 900 Maps,
> i.e., 100 extra.
> So, why this extra number of Maps? Although, my job is completing
> successfully without any error.
>
> Again, if I don't execute the "RandomTextWwriter" job to create data for
> my wordcount, rather I put my own 100 GB text file in HDFS and run
> "WordCount", I can then see the number of Maps are equivalent to my
> calculation, i.e., 800.
>
> Can anyone tell me why this odd behaviour of Hadoop regarding the number
> of Maps for WordCount only when the dataset is generated by
> RandomTextWriter? And what is the purpose of these extra number of Maps?
>
> Regards,
> Gaurav Dasgupta
>
>

Re: Number of Maps running more than expected

Posted by Gaurav Dasgupta <gd...@gmail.com>.
Hi Anil,

The speculative execution property is off from the begining.
In addition to my previous mail, I would like to add some more points:

I have checked the same while generating 100 GB data from RandomTextWriter,
where "hadoop fs -dus <hdfs output dir>" gives me 102.65 GB.
So if I calculate the number of Maps while running WordCount on this should
be ((102.65 * 1024) MB / 128 MB) = 821.22. So, there should be 822 Maps
running, but the actual number of Maps running are 900, i.e., extra 78.
Also, the number of Maps ran for the above RandomTextWriter were 100.

Above is not the same if I generate data using TeraGen (hadoop jar
hadoop-examples.jar teragen 10000 <output_dir>) and then perform WordCount
on it, it gives me the number of Maps = 2 (Note: Number of Maps for TeraGen
was also 2).
Please find in the attachement the screenshots of the JobTracker UI for
RandomTextWriter and WordCount for your reference.

Regards,
Gaurav Dasgupta
On Thu, Aug 16, 2012 at 7:57 PM, Anil Gupta <an...@gmail.com> wrote:

>  Hi Gaurav,
>
> Did you turn off speculative execution?
>
> Best Regards,
> Anil
>
> On Aug 16, 2012, at 7:13 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:
>
>   Hi users,
>
> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
> the 12 nodes and 1 node running the Job Tracker).
> In order to perform a WordCount benchmark test, I did the following:
>
>    - Executed "RandomTextWriter" first to create 100 GB data (Note that I
>    have changed the "test.randomtextwrite.total_bytes" parameter only, rest
>    all are kept default).
>    - Next, executed the "WordCount" program for that 100 GB dataset.
>
> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my
> calculation, total number of Maps to be executed by the wordcount job
> should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> But when I am executing the job, it is running a total number of 900 Maps,
> i.e., 100 extra.
> So, why this extra number of Maps? Although, my job is completing
> successfully without any error.
>
> Again, if I don't execute the "RandomTextWwriter" job to create data for
> my wordcount, rather I put my own 100 GB text file in HDFS and run
> "WordCount", I can then see the number of Maps are equivalent to my
> calculation, i.e., 800.
>
> Can anyone tell me why this odd behaviour of Hadoop regarding the number
> of Maps for WordCount only when the dataset is generated by
> RandomTextWriter? And what is the purpose of these extra number of Maps?
>
> Regards,
> Gaurav Dasgupta
>
>

Re: Number of Maps running more than expected

Posted by Anil Gupta <an...@gmail.com>.
Hi Gaurav,

Did you turn off speculative execution?

Best Regards,
Anil

On Aug 16, 2012, at 7:13 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:

> Hi users,
>  
> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all the 12 nodes and 1 node running the Job Tracker).
> In order to perform a WordCount benchmark test, I did the following:
> Executed "RandomTextWriter" first to create 100 GB data (Note that I have changed the "test.randomtextwrite.total_bytes" parameter only, rest all are kept default).
> Next, executed the "WordCount" program for that 100 GB dataset.
> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my calculation, total number of Maps to be executed by the wordcount job should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> But when I am executing the job, it is running a total number of 900 Maps, i.e., 100 extra.
> So, why this extra number of Maps? Although, my job is completing successfully without any error.
>  
> Again, if I don't execute the "RandomTextWwriter" job to create data for my wordcount, rather I put my own 100 GB text file in HDFS and run "WordCount", I can then see the number of Maps are equivalent to my calculation, i.e., 800.
>  
> Can anyone tell me why this odd behaviour of Hadoop regarding the number of Maps for WordCount only when the dataset is generated by RandomTextWriter? And what is the purpose of these extra number of Maps?
>  
> Regards,
> Gaurav Dasgupta

Re: Number of Maps running more than expected

Posted by Anil Gupta <an...@gmail.com>.
Hi Gaurav,

Did you turn off speculative execution?

Best Regards,
Anil

On Aug 16, 2012, at 7:13 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:

> Hi users,
>  
> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all the 12 nodes and 1 node running the Job Tracker).
> In order to perform a WordCount benchmark test, I did the following:
> Executed "RandomTextWriter" first to create 100 GB data (Note that I have changed the "test.randomtextwrite.total_bytes" parameter only, rest all are kept default).
> Next, executed the "WordCount" program for that 100 GB dataset.
> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my calculation, total number of Maps to be executed by the wordcount job should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> But when I am executing the job, it is running a total number of 900 Maps, i.e., 100 extra.
> So, why this extra number of Maps? Although, my job is completing successfully without any error.
>  
> Again, if I don't execute the "RandomTextWwriter" job to create data for my wordcount, rather I put my own 100 GB text file in HDFS and run "WordCount", I can then see the number of Maps are equivalent to my calculation, i.e., 800.
>  
> Can anyone tell me why this odd behaviour of Hadoop regarding the number of Maps for WordCount only when the dataset is generated by RandomTextWriter? And what is the purpose of these extra number of Maps?
>  
> Regards,
> Gaurav Dasgupta

Re: Number of Maps running more than expected

Posted by Anil Gupta <an...@gmail.com>.
Hi Gaurav,

Did you turn off speculative execution?

Best Regards,
Anil

On Aug 16, 2012, at 7:13 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:

> Hi users,
>  
> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all the 12 nodes and 1 node running the Job Tracker).
> In order to perform a WordCount benchmark test, I did the following:
> Executed "RandomTextWriter" first to create 100 GB data (Note that I have changed the "test.randomtextwrite.total_bytes" parameter only, rest all are kept default).
> Next, executed the "WordCount" program for that 100 GB dataset.
> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my calculation, total number of Maps to be executed by the wordcount job should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> But when I am executing the job, it is running a total number of 900 Maps, i.e., 100 extra.
> So, why this extra number of Maps? Although, my job is completing successfully without any error.
>  
> Again, if I don't execute the "RandomTextWwriter" job to create data for my wordcount, rather I put my own 100 GB text file in HDFS and run "WordCount", I can then see the number of Maps are equivalent to my calculation, i.e., 800.
>  
> Can anyone tell me why this odd behaviour of Hadoop regarding the number of Maps for WordCount only when the dataset is generated by RandomTextWriter? And what is the purpose of these extra number of Maps?
>  
> Regards,
> Gaurav Dasgupta

Re: Number of Maps running more than expected

Posted by Anil Gupta <an...@gmail.com>.
Hi Gaurav,

Did you turn off speculative execution?

Best Regards,
Anil

On Aug 16, 2012, at 7:13 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:

> Hi users,
>  
> I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all the 12 nodes and 1 node running the Job Tracker).
> In order to perform a WordCount benchmark test, I did the following:
> Executed "RandomTextWriter" first to create 100 GB data (Note that I have changed the "test.randomtextwrite.total_bytes" parameter only, rest all are kept default).
> Next, executed the "WordCount" program for that 100 GB dataset.
> The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my calculation, total number of Maps to be executed by the wordcount job should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> But when I am executing the job, it is running a total number of 900 Maps, i.e., 100 extra.
> So, why this extra number of Maps? Although, my job is completing successfully without any error.
>  
> Again, if I don't execute the "RandomTextWwriter" job to create data for my wordcount, rather I put my own 100 GB text file in HDFS and run "WordCount", I can then see the number of Maps are equivalent to my calculation, i.e., 800.
>  
> Can anyone tell me why this odd behaviour of Hadoop regarding the number of Maps for WordCount only when the dataset is generated by RandomTextWriter? And what is the purpose of these extra number of Maps?
>  
> Regards,
> Gaurav Dasgupta