You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Patai Sangbutsarakum <si...@gmail.com> on 2013/08/09 22:41:10 UTC

Jobtracker page hangs ..again.

A while back, i was fighting with the jobtracker page hangs when i browse
to http://jobtracker:50030 browser doesn't show jobs info as usual which
ends up because of allowing too much job history kept in jobtracker.

Currently, i am setting up a new cluster 40g heap on the namenode and
jobtracker in dedicated machines. Not fun part starts here; a developer
tried to test out the cluster by launching a 76k map job (the cluster has
around 6k-ish mappers)
Job initialization was success, and finished the job.

However, before the job is actually running, i can't access to the
jobtracker page anymore same symptom as above.

i see bunch of this in jobtracker log

2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201307291733_0619_m_076796 has split on node: /rack/node
..
..
..

Until i see this

INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
LOCALITY_WAIT_FACTOR=1.0
2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
job_201307291733_0619 initialized successfully with 76797 map tasks and 10
reduce tasks.

that's when i can access to the jobtracker page again.


CPU on jobtracker is very little load, JTK's Heap is far from full like
1ish gig from 40
network bandwidth is far from filled up.

I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
Environment (build 1.6.0_32-b05)


What would be the root cause i should looking at or at least where to start?

Thanks you in advanced

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Thanks harsh, 
Appreciate your input, as always.

On Aug 12, 2013, at 20:01, Harsh J <ha...@cloudera.com> wrote:

> If you're not already doing it, run a local name caching daemon (such
> as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
> a local cache would do good in reducing the load on your DNS.
> 
> On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>> Update, after adjust the network routing, dns query speed is in micro sec as
>> suppose to be. the issue is completely solve.
>> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>> 
>> Cheers,
>> 
>> 
>> 
>> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>> 
>>> Ok, after some sweat, i think I found the root cause but still need
>>> another team to help me fix it.
>>> It lines on the DNS.  Somehow each of the tip:task line, through the
>>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>> 
>>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>> 
>>> 127 ms
>>> 
>>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>> 
>>> 126 ms
>>> 
>>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>> 
>>> 
>>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>> 
>>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>> 
>>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>> 
>>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>> 
>>> 
>>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>> 
>>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>> 
>>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>>> (189)
>>> 
>>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>> 
>>> 
>>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>> 
>>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>> 
>>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>>> (189)
>>> 
>>> I looked at the jobtracker log in other datacenter when submitted with the
>>> same query. Timestamp in each tip:task line is much much faster.
>>> 
>>> The question that raise here is the job initialization is really request
>>> the DNS, if so is there any way to suppress that. topology file is already
>>> in place where name and ip are already there.
>>> 
>>> 
>>> Hope this make sense
>>> 
>>> Patai
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>>> <si...@gmail.com> wrote:
>>>> 
>>>> Appreciate your input Bryant, i will try to reproduce and see the
>>>> namenode log before, while, and after it pause.
>>>> Wish me luck
>>>> 
>>>> 
>>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>>> <bb...@hubspot.com> wrote:
>>>>> 
>>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>>> be one of the following two (so far) possibilities:
>>>>> 
>>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>>> - hdfs is slow
>>>>> 
>>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>>>> larger and larger jobs, aside form the size of the splits serialization in
>>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>>>> the 50030 web ui.
>>>>> 
>>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>>>> before and after that pause?
>>>>> 
>>>>> 
>>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
>>>>> <si...@gmail.com> wrote:
>>>>>> 
>>>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as usual
>>>>>> which ends up because of allowing too much job history kept in jobtracker.
>>>>>> 
>>>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>>>> around 6k-ish mappers)
>>>>>> Job initialization was success, and finished the job.
>>>>>> 
>>>>>> However, before the job is actually running, i can't access to the
>>>>>> jobtracker page anymore same symptom as above.
>>>>>> 
>>>>>> i see bunch of this in jobtracker log
>>>>>> 
>>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>>>> ..
>>>>>> ..
>>>>>> ..
>>>>>> 
>>>>>> Until i see this
>>>>>> 
>>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>>>> LOCALITY_WAIT_FACTOR=1.0
>>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>>>> 10 reduce tasks.
>>>>>> 
>>>>>> that's when i can access to the jobtracker page again.
>>>>>> 
>>>>>> 
>>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>>>> 1ish gig from 40
>>>>>> network bandwidth is far from filled up.
>>>>>> 
>>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>>>> Environment (build 1.6.0_32-b05)
>>>>>> 
>>>>>> 
>>>>>> What would be the root cause i should looking at or at least where to
>>>>>> start?
>>>>>> 
>>>>>> Thanks you in advanced
> 
> 
> 
> -- 
> Harsh J

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Thanks harsh, 
Appreciate your input, as always.

On Aug 12, 2013, at 20:01, Harsh J <ha...@cloudera.com> wrote:

> If you're not already doing it, run a local name caching daemon (such
> as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
> a local cache would do good in reducing the load on your DNS.
> 
> On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>> Update, after adjust the network routing, dns query speed is in micro sec as
>> suppose to be. the issue is completely solve.
>> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>> 
>> Cheers,
>> 
>> 
>> 
>> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>> 
>>> Ok, after some sweat, i think I found the root cause but still need
>>> another team to help me fix it.
>>> It lines on the DNS.  Somehow each of the tip:task line, through the
>>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>> 
>>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>> 
>>> 127 ms
>>> 
>>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>> 
>>> 126 ms
>>> 
>>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>> 
>>> 
>>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>> 
>>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>> 
>>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>> 
>>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>> 
>>> 
>>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>> 
>>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>> 
>>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>>> (189)
>>> 
>>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>> 
>>> 
>>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>> 
>>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>> 
>>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>>> (189)
>>> 
>>> I looked at the jobtracker log in other datacenter when submitted with the
>>> same query. Timestamp in each tip:task line is much much faster.
>>> 
>>> The question that raise here is the job initialization is really request
>>> the DNS, if so is there any way to suppress that. topology file is already
>>> in place where name and ip are already there.
>>> 
>>> 
>>> Hope this make sense
>>> 
>>> Patai
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>>> <si...@gmail.com> wrote:
>>>> 
>>>> Appreciate your input Bryant, i will try to reproduce and see the
>>>> namenode log before, while, and after it pause.
>>>> Wish me luck
>>>> 
>>>> 
>>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>>> <bb...@hubspot.com> wrote:
>>>>> 
>>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>>> be one of the following two (so far) possibilities:
>>>>> 
>>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>>> - hdfs is slow
>>>>> 
>>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>>>> larger and larger jobs, aside form the size of the splits serialization in
>>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>>>> the 50030 web ui.
>>>>> 
>>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>>>> before and after that pause?
>>>>> 
>>>>> 
>>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
>>>>> <si...@gmail.com> wrote:
>>>>>> 
>>>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as usual
>>>>>> which ends up because of allowing too much job history kept in jobtracker.
>>>>>> 
>>>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>>>> around 6k-ish mappers)
>>>>>> Job initialization was success, and finished the job.
>>>>>> 
>>>>>> However, before the job is actually running, i can't access to the
>>>>>> jobtracker page anymore same symptom as above.
>>>>>> 
>>>>>> i see bunch of this in jobtracker log
>>>>>> 
>>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>>>> ..
>>>>>> ..
>>>>>> ..
>>>>>> 
>>>>>> Until i see this
>>>>>> 
>>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>>>> LOCALITY_WAIT_FACTOR=1.0
>>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>>>> 10 reduce tasks.
>>>>>> 
>>>>>> that's when i can access to the jobtracker page again.
>>>>>> 
>>>>>> 
>>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>>>> 1ish gig from 40
>>>>>> network bandwidth is far from filled up.
>>>>>> 
>>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>>>> Environment (build 1.6.0_32-b05)
>>>>>> 
>>>>>> 
>>>>>> What would be the root cause i should looking at or at least where to
>>>>>> start?
>>>>>> 
>>>>>> Thanks you in advanced
> 
> 
> 
> -- 
> Harsh J

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Thanks harsh, 
Appreciate your input, as always.

On Aug 12, 2013, at 20:01, Harsh J <ha...@cloudera.com> wrote:

> If you're not already doing it, run a local name caching daemon (such
> as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
> a local cache would do good in reducing the load on your DNS.
> 
> On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>> Update, after adjust the network routing, dns query speed is in micro sec as
>> suppose to be. the issue is completely solve.
>> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>> 
>> Cheers,
>> 
>> 
>> 
>> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>> 
>>> Ok, after some sweat, i think I found the root cause but still need
>>> another team to help me fix it.
>>> It lines on the DNS.  Somehow each of the tip:task line, through the
>>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>> 
>>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>> 
>>> 127 ms
>>> 
>>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>> 
>>> 126 ms
>>> 
>>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>> 
>>> 
>>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>> 
>>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>> 
>>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>> 
>>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>> 
>>> 
>>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>> 
>>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>> 
>>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>>> (189)
>>> 
>>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>> 
>>> 
>>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>> 
>>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>> 
>>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>>> (189)
>>> 
>>> I looked at the jobtracker log in other datacenter when submitted with the
>>> same query. Timestamp in each tip:task line is much much faster.
>>> 
>>> The question that raise here is the job initialization is really request
>>> the DNS, if so is there any way to suppress that. topology file is already
>>> in place where name and ip are already there.
>>> 
>>> 
>>> Hope this make sense
>>> 
>>> Patai
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>>> <si...@gmail.com> wrote:
>>>> 
>>>> Appreciate your input Bryant, i will try to reproduce and see the
>>>> namenode log before, while, and after it pause.
>>>> Wish me luck
>>>> 
>>>> 
>>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>>> <bb...@hubspot.com> wrote:
>>>>> 
>>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>>> be one of the following two (so far) possibilities:
>>>>> 
>>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>>> - hdfs is slow
>>>>> 
>>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>>>> larger and larger jobs, aside form the size of the splits serialization in
>>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>>>> the 50030 web ui.
>>>>> 
>>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>>>> before and after that pause?
>>>>> 
>>>>> 
>>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
>>>>> <si...@gmail.com> wrote:
>>>>>> 
>>>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as usual
>>>>>> which ends up because of allowing too much job history kept in jobtracker.
>>>>>> 
>>>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>>>> around 6k-ish mappers)
>>>>>> Job initialization was success, and finished the job.
>>>>>> 
>>>>>> However, before the job is actually running, i can't access to the
>>>>>> jobtracker page anymore same symptom as above.
>>>>>> 
>>>>>> i see bunch of this in jobtracker log
>>>>>> 
>>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>>>> ..
>>>>>> ..
>>>>>> ..
>>>>>> 
>>>>>> Until i see this
>>>>>> 
>>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>>>> LOCALITY_WAIT_FACTOR=1.0
>>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>>>> 10 reduce tasks.
>>>>>> 
>>>>>> that's when i can access to the jobtracker page again.
>>>>>> 
>>>>>> 
>>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>>>> 1ish gig from 40
>>>>>> network bandwidth is far from filled up.
>>>>>> 
>>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>>>> Environment (build 1.6.0_32-b05)
>>>>>> 
>>>>>> 
>>>>>> What would be the root cause i should looking at or at least where to
>>>>>> start?
>>>>>> 
>>>>>> Thanks you in advanced
> 
> 
> 
> -- 
> Harsh J

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Thanks harsh, 
Appreciate your input, as always.

On Aug 12, 2013, at 20:01, Harsh J <ha...@cloudera.com> wrote:

> If you're not already doing it, run a local name caching daemon (such
> as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
> a local cache would do good in reducing the load on your DNS.
> 
> On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>> Update, after adjust the network routing, dns query speed is in micro sec as
>> suppose to be. the issue is completely solve.
>> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>> 
>> Cheers,
>> 
>> 
>> 
>> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>> 
>>> Ok, after some sweat, i think I found the root cause but still need
>>> another team to help me fix it.
>>> It lines on the DNS.  Somehow each of the tip:task line, through the
>>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>> 
>>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>> 
>>> 127 ms
>>> 
>>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>> 
>>> 126 ms
>>> 
>>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>> 
>>> 
>>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>> 
>>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>> 
>>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>> 
>>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>> 
>>> 
>>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>> 
>>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>> 
>>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>>> (189)
>>> 
>>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>> 
>>> 
>>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>> 
>>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>> 
>>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>>> (189)
>>> 
>>> I looked at the jobtracker log in other datacenter when submitted with the
>>> same query. Timestamp in each tip:task line is much much faster.
>>> 
>>> The question that raise here is the job initialization is really request
>>> the DNS, if so is there any way to suppress that. topology file is already
>>> in place where name and ip are already there.
>>> 
>>> 
>>> Hope this make sense
>>> 
>>> Patai
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>>> <si...@gmail.com> wrote:
>>>> 
>>>> Appreciate your input Bryant, i will try to reproduce and see the
>>>> namenode log before, while, and after it pause.
>>>> Wish me luck
>>>> 
>>>> 
>>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>>> <bb...@hubspot.com> wrote:
>>>>> 
>>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>>> be one of the following two (so far) possibilities:
>>>>> 
>>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>>> - hdfs is slow
>>>>> 
>>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>>>> larger and larger jobs, aside form the size of the splits serialization in
>>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>>>> the 50030 web ui.
>>>>> 
>>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>>>> before and after that pause?
>>>>> 
>>>>> 
>>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
>>>>> <si...@gmail.com> wrote:
>>>>>> 
>>>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as usual
>>>>>> which ends up because of allowing too much job history kept in jobtracker.
>>>>>> 
>>>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>>>> around 6k-ish mappers)
>>>>>> Job initialization was success, and finished the job.
>>>>>> 
>>>>>> However, before the job is actually running, i can't access to the
>>>>>> jobtracker page anymore same symptom as above.
>>>>>> 
>>>>>> i see bunch of this in jobtracker log
>>>>>> 
>>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>>>> ..
>>>>>> ..
>>>>>> ..
>>>>>> 
>>>>>> Until i see this
>>>>>> 
>>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>>>> LOCALITY_WAIT_FACTOR=1.0
>>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>>>> 10 reduce tasks.
>>>>>> 
>>>>>> that's when i can access to the jobtracker page again.
>>>>>> 
>>>>>> 
>>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>>>> 1ish gig from 40
>>>>>> network bandwidth is far from filled up.
>>>>>> 
>>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>>>> Environment (build 1.6.0_32-b05)
>>>>>> 
>>>>>> 
>>>>>> What would be the root cause i should looking at or at least where to
>>>>>> start?
>>>>>> 
>>>>>> Thanks you in advanced
> 
> 
> 
> -- 
> Harsh J

Re: Jobtracker page hangs ..again.

Posted by Harsh J <ha...@cloudera.com>.

If you're not already doing it, run a local name caching daemon (such
as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
a local cache would do good in reducing the load on your DNS.

On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Update, after adjust the network routing, dns query speed is in micro sec as
> suppose to be. the issue is completely solve.
> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>
> Cheers,
>
>
>
> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>>
>> Ok, after some sweat, i think I found the root cause but still need
>> another team to help me fix it.
>> It lines on the DNS.  Somehow each of the tip:task line, through the
>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>
>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>
>> 127 ms
>>
>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>
>> 126 ms
>>
>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>
>>
>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>
>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>
>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>
>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>
>>
>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>
>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>
>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>> (189)
>>
>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>
>>
>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>
>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>
>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>> (189)
>>
>> I looked at the jobtracker log in other datacenter when submitted with the
>> same query. Timestamp in each tip:task line is much much faster.
>>
>> The question that raise here is the job initialization is really request
>> the DNS, if so is there any way to suppress that. topology file is already
>> in place where name and ip are already there.
>>
>>
>> Hope this make sense
>>
>> Patai
>>
>>
>>
>>
>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>>
>>> Appreciate your input Bryant, i will try to reproduce and see the
>>> namenode log before, while, and after it pause.
>>> Wish me luck
>>>
>>>
>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>> <bb...@hubspot.com> wrote:
>>>>
>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>> be one of the following two (so far) possibilities:
>>>>
>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>> - hdfs is slow
>>>>
>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>>> larger and larger jobs, aside form the size of the splits serialization in
>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>>> the 50030 web ui.
>>>>
>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>>> before and after that pause?
>>>>
>>>>
>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
>>>> <si...@gmail.com> wrote:
>>>>>
>>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as usual
>>>>> which ends up because of allowing too much job history kept in jobtracker.
>>>>>
>>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>>> around 6k-ish mappers)
>>>>> Job initialization was success, and finished the job.
>>>>>
>>>>> However, before the job is actually running, i can't access to the
>>>>> jobtracker page anymore same symptom as above.
>>>>>
>>>>> i see bunch of this in jobtracker log
>>>>>
>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>>> ..
>>>>> ..
>>>>> ..
>>>>>
>>>>> Until i see this
>>>>>
>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>>> LOCALITY_WAIT_FACTOR=1.0
>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>>> 10 reduce tasks.
>>>>>
>>>>> that's when i can access to the jobtracker page again.
>>>>>
>>>>>
>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>>> 1ish gig from 40
>>>>> network bandwidth is far from filled up.
>>>>>
>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>>> Environment (build 1.6.0_32-b05)
>>>>>
>>>>>
>>>>> What would be the root cause i should looking at or at least where to
>>>>> start?
>>>>>
>>>>> Thanks you in advanced
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Harsh J

Re: Jobtracker page hangs ..again.

Posted by Harsh J <ha...@cloudera.com>.

If you're not already doing it, run a local name caching daemon (such
as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
a local cache would do good in reducing the load on your DNS.

On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Update, after adjust the network routing, dns query speed is in micro sec as
> suppose to be. the issue is completely solve.
> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>
> Cheers,
>
>
>
> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>>
>> Ok, after some sweat, i think I found the root cause but still need
>> another team to help me fix it.
>> It lines on the DNS.  Somehow each of the tip:task line, through the
>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>
>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>
>> 127 ms
>>
>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>
>> 126 ms
>>
>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>
>>
>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>
>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>
>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>
>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>
>>
>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>
>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>
>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>> (189)
>>
>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>
>>
>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>
>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>
>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>> (189)
>>
>> I looked at the jobtracker log in other datacenter when submitted with the
>> same query. Timestamp in each tip:task line is much much faster.
>>
>> The question that raise here is the job initialization is really request
>> the DNS, if so is there any way to suppress that. topology file is already
>> in place where name and ip are already there.
>>
>>
>> Hope this make sense
>>
>> Patai
>>
>>
>>
>>
>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>>
>>> Appreciate your input Bryant, i will try to reproduce and see the
>>> namenode log before, while, and after it pause.
>>> Wish me luck
>>>
>>>
>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>> <bb...@hubspot.com> wrote:
>>>>
>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>> be one of the following two (so far) possibilities:
>>>>
>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>> - hdfs is slow
>>>>
>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>>> larger and larger jobs, aside form the size of the splits serialization in
>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>>> the 50030 web ui.
>>>>
>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>>> before and after that pause?
>>>>
>>>>
>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
>>>> <si...@gmail.com> wrote:
>>>>>
>>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as usual
>>>>> which ends up because of allowing too much job history kept in jobtracker.
>>>>>
>>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>>> around 6k-ish mappers)
>>>>> Job initialization was success, and finished the job.
>>>>>
>>>>> However, before the job is actually running, i can't access to the
>>>>> jobtracker page anymore same symptom as above.
>>>>>
>>>>> i see bunch of this in jobtracker log
>>>>>
>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>>> ..
>>>>> ..
>>>>> ..
>>>>>
>>>>> Until i see this
>>>>>
>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>>> LOCALITY_WAIT_FACTOR=1.0
>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>>> 10 reduce tasks.
>>>>>
>>>>> that's when i can access to the jobtracker page again.
>>>>>
>>>>>
>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>>> 1ish gig from 40
>>>>> network bandwidth is far from filled up.
>>>>>
>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>>> Environment (build 1.6.0_32-b05)
>>>>>
>>>>>
>>>>> What would be the root cause i should looking at or at least where to
>>>>> start?
>>>>>
>>>>> Thanks you in advanced
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Harsh J

Re: Jobtracker page hangs ..again.

Posted by Harsh J <ha...@cloudera.com>.

If you're not already doing it, run a local name caching daemon (such
as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
a local cache would do good in reducing the load on your DNS.

On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Update, after adjust the network routing, dns query speed is in micro sec as
> suppose to be. the issue is completely solve.
> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>
> Cheers,
>
>
>
> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>>
>> Ok, after some sweat, i think I found the root cause but still need
>> another team to help me fix it.
>> It lines on the DNS.  Somehow each of the tip:task line, through the
>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>
>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>
>> 127 ms
>>
>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>
>> 126 ms
>>
>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>
>>
>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>
>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>
>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>
>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>
>>
>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>
>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>
>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>> (189)
>>
>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>
>>
>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>
>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>
>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>> (189)
>>
>> I looked at the jobtracker log in other datacenter when submitted with the
>> same query. Timestamp in each tip:task line is much much faster.
>>
>> The question that raise here is the job initialization is really request
>> the DNS, if so is there any way to suppress that. topology file is already
>> in place where name and ip are already there.
>>
>>
>> Hope this make sense
>>
>> Patai
>>
>>
>>
>>
>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>>
>>> Appreciate your input Bryant, i will try to reproduce and see the
>>> namenode log before, while, and after it pause.
>>> Wish me luck
>>>
>>>
>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>> <bb...@hubspot.com> wrote:
>>>>
>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>> be one of the following two (so far) possibilities:
>>>>
>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>> - hdfs is slow
>>>>
>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>>> larger and larger jobs, aside form the size of the splits serialization in
>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>>> the 50030 web ui.
>>>>
>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>>> before and after that pause?
>>>>
>>>>
>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
>>>> <si...@gmail.com> wrote:
>>>>>
>>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as usual
>>>>> which ends up because of allowing too much job history kept in jobtracker.
>>>>>
>>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>>> around 6k-ish mappers)
>>>>> Job initialization was success, and finished the job.
>>>>>
>>>>> However, before the job is actually running, i can't access to the
>>>>> jobtracker page anymore same symptom as above.
>>>>>
>>>>> i see bunch of this in jobtracker log
>>>>>
>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>>> ..
>>>>> ..
>>>>> ..
>>>>>
>>>>> Until i see this
>>>>>
>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>>> LOCALITY_WAIT_FACTOR=1.0
>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>>> 10 reduce tasks.
>>>>>
>>>>> that's when i can access to the jobtracker page again.
>>>>>
>>>>>
>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>>> 1ish gig from 40
>>>>> network bandwidth is far from filled up.
>>>>>
>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>>> Environment (build 1.6.0_32-b05)
>>>>>
>>>>>
>>>>> What would be the root cause i should looking at or at least where to
>>>>> start?
>>>>>
>>>>> Thanks you in advanced
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Harsh J

Re: Jobtracker page hangs ..again.

Posted by Harsh J <ha...@cloudera.com>.

If you're not already doing it, run a local name caching daemon (such
as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
a local cache would do good in reducing the load on your DNS.

On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Update, after adjust the network routing, dns query speed is in micro sec as
> suppose to be. the issue is completely solve.
> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>
> Cheers,
>
>
>
> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>>
>> Ok, after some sweat, i think I found the root cause but still need
>> another team to help me fix it.
>> It lines on the DNS.  Somehow each of the tip:task line, through the
>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>
>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>
>> 127 ms
>>
>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>
>> 126 ms
>>
>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>
>>
>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>
>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>
>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>
>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>
>>
>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>
>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>
>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>> (189)
>>
>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>
>>
>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>
>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>
>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>> (189)
>>
>> I looked at the jobtracker log in other datacenter when submitted with the
>> same query. Timestamp in each tip:task line is much much faster.
>>
>> The question that raise here is the job initialization is really request
>> the DNS, if so is there any way to suppress that. topology file is already
>> in place where name and ip are already there.
>>
>>
>> Hope this make sense
>>
>> Patai
>>
>>
>>
>>
>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>>
>>> Appreciate your input Bryant, i will try to reproduce and see the
>>> namenode log before, while, and after it pause.
>>> Wish me luck
>>>
>>>
>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>> <bb...@hubspot.com> wrote:
>>>>
>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>> be one of the following two (so far) possibilities:
>>>>
>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>> - hdfs is slow
>>>>
>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>>> larger and larger jobs, aside form the size of the splits serialization in
>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>>> the 50030 web ui.
>>>>
>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>>> before and after that pause?
>>>>
>>>>
>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
>>>> <si...@gmail.com> wrote:
>>>>>
>>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as usual
>>>>> which ends up because of allowing too much job history kept in jobtracker.
>>>>>
>>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>>> around 6k-ish mappers)
>>>>> Job initialization was success, and finished the job.
>>>>>
>>>>> However, before the job is actually running, i can't access to the
>>>>> jobtracker page anymore same symptom as above.
>>>>>
>>>>> i see bunch of this in jobtracker log
>>>>>
>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>>> ..
>>>>> ..
>>>>> ..
>>>>>
>>>>> Until i see this
>>>>>
>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>>> LOCALITY_WAIT_FACTOR=1.0
>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>>> 10 reduce tasks.
>>>>>
>>>>> that's when i can access to the jobtracker page again.
>>>>>
>>>>>
>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>>> 1ish gig from 40
>>>>> network bandwidth is far from filled up.
>>>>>
>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>>> Environment (build 1.6.0_32-b05)
>>>>>
>>>>>
>>>>> What would be the root cause i should looking at or at least where to
>>>>> start?
>>>>>
>>>>> Thanks you in advanced
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Harsh J

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Update, after adjust the network routing, dns query speed is in micro sec
as suppose to be. the issue is completely solve.
Jobtracker page doesn't hang anymore when launch 100k mappers job..

Cheers,



On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> Ok, after some sweat, i think I found the root cause but still need
> another team to help me fix it.
> It lines on the DNS.  Somehow each of the tip:task line, through the
> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>
> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>
> 127 ms
>
> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>
> 126 ms
>
> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>
>
> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>
> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>
> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>
> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>
>
> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>
> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>
> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
> (189)
>
> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>
>
> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>
> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>
> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
> (189)
>
> I looked at the jobtracker log in other datacenter when submitted with the
> same query. Timestamp in each tip:task line is much much faster.
>
> The question that raise here is the job initialization is really request
> the DNS, if so is there any way to suppress that. topology file is already
> in place where name and ip are already there.
>
>
> Hope this make sense
>
> Patai
>
>
>
>
> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
> silvianhadoop@gmail.com> wrote:
>
>> Appreciate your input Bryant, i will try to reproduce and see the
>> namenode log before, while, and after it pause.
>> Wish me luck
>>
>>
>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
>> bbeaudreault@hubspot.com> wrote:
>>
>>> When I've had problems with a slow jobtracker, i've found the issue to
>>> be one of the following two (so far) possibilities:
>>>
>>> - long GC pause (I'm guessing this is not it based on your email)
>>> - hdfs is slow
>>>
>>> I haven't dived into the code yet, but circumstantially I've found that
>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>> larger and larger jobs, aside form the size of the splits serialization in
>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>> the 50030 web ui.
>>>
>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>> before and after that pause?
>>>
>>>
>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>>> silvianhadoop@gmail.com> wrote:
>>>
>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>>> usual which ends up because of allowing too much job history kept in
>>>> jobtracker.
>>>>
>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>> around 6k-ish mappers)
>>>> Job initialization was success, and finished the job.
>>>>
>>>> However, before the job is actually running, i can't access to the
>>>> jobtracker page anymore same symptom as above.
>>>>
>>>> i see bunch of this in jobtracker log
>>>>
>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>> ..
>>>> ..
>>>> ..
>>>>
>>>> Until i see this
>>>>
>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>> LOCALITY_WAIT_FACTOR=1.0
>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>> 10 reduce tasks.
>>>>
>>>> that's when i can access to the jobtracker page again.
>>>>
>>>>
>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>> 1ish gig from 40
>>>> network bandwidth is far from filled up.
>>>>
>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>> Environment (build 1.6.0_32-b05)
>>>>
>>>>
>>>> What would be the root cause i should looking at or at least where to
>>>> start?
>>>>
>>>> Thanks you in advanced
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Update, after adjust the network routing, dns query speed is in micro sec
as suppose to be. the issue is completely solve.
Jobtracker page doesn't hang anymore when launch 100k mappers job..

Cheers,



On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> Ok, after some sweat, i think I found the root cause but still need
> another team to help me fix it.
> It lines on the DNS.  Somehow each of the tip:task line, through the
> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>
> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>
> 127 ms
>
> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>
> 126 ms
>
> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>
>
> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>
> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>
> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>
> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>
>
> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>
> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>
> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
> (189)
>
> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>
>
> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>
> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>
> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
> (189)
>
> I looked at the jobtracker log in other datacenter when submitted with the
> same query. Timestamp in each tip:task line is much much faster.
>
> The question that raise here is the job initialization is really request
> the DNS, if so is there any way to suppress that. topology file is already
> in place where name and ip are already there.
>
>
> Hope this make sense
>
> Patai
>
>
>
>
> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
> silvianhadoop@gmail.com> wrote:
>
>> Appreciate your input Bryant, i will try to reproduce and see the
>> namenode log before, while, and after it pause.
>> Wish me luck
>>
>>
>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
>> bbeaudreault@hubspot.com> wrote:
>>
>>> When I've had problems with a slow jobtracker, i've found the issue to
>>> be one of the following two (so far) possibilities:
>>>
>>> - long GC pause (I'm guessing this is not it based on your email)
>>> - hdfs is slow
>>>
>>> I haven't dived into the code yet, but circumstantially I've found that
>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>> larger and larger jobs, aside form the size of the splits serialization in
>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>> the 50030 web ui.
>>>
>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>> before and after that pause?
>>>
>>>
>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>>> silvianhadoop@gmail.com> wrote:
>>>
>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>>> usual which ends up because of allowing too much job history kept in
>>>> jobtracker.
>>>>
>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>> around 6k-ish mappers)
>>>> Job initialization was success, and finished the job.
>>>>
>>>> However, before the job is actually running, i can't access to the
>>>> jobtracker page anymore same symptom as above.
>>>>
>>>> i see bunch of this in jobtracker log
>>>>
>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>> ..
>>>> ..
>>>> ..
>>>>
>>>> Until i see this
>>>>
>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>> LOCALITY_WAIT_FACTOR=1.0
>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>> 10 reduce tasks.
>>>>
>>>> that's when i can access to the jobtracker page again.
>>>>
>>>>
>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>> 1ish gig from 40
>>>> network bandwidth is far from filled up.
>>>>
>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>> Environment (build 1.6.0_32-b05)
>>>>
>>>>
>>>> What would be the root cause i should looking at or at least where to
>>>> start?
>>>>
>>>> Thanks you in advanced
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Update, after adjust the network routing, dns query speed is in micro sec
as suppose to be. the issue is completely solve.
Jobtracker page doesn't hang anymore when launch 100k mappers job..

Cheers,



On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> Ok, after some sweat, i think I found the root cause but still need
> another team to help me fix it.
> It lines on the DNS.  Somehow each of the tip:task line, through the
> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>
> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>
> 127 ms
>
> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>
> 126 ms
>
> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>
>
> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>
> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>
> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>
> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>
>
> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>
> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>
> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
> (189)
>
> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>
>
> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>
> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>
> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
> (189)
>
> I looked at the jobtracker log in other datacenter when submitted with the
> same query. Timestamp in each tip:task line is much much faster.
>
> The question that raise here is the job initialization is really request
> the DNS, if so is there any way to suppress that. topology file is already
> in place where name and ip are already there.
>
>
> Hope this make sense
>
> Patai
>
>
>
>
> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
> silvianhadoop@gmail.com> wrote:
>
>> Appreciate your input Bryant, i will try to reproduce and see the
>> namenode log before, while, and after it pause.
>> Wish me luck
>>
>>
>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
>> bbeaudreault@hubspot.com> wrote:
>>
>>> When I've had problems with a slow jobtracker, i've found the issue to
>>> be one of the following two (so far) possibilities:
>>>
>>> - long GC pause (I'm guessing this is not it based on your email)
>>> - hdfs is slow
>>>
>>> I haven't dived into the code yet, but circumstantially I've found that
>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>> larger and larger jobs, aside form the size of the splits serialization in
>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>> the 50030 web ui.
>>>
>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>> before and after that pause?
>>>
>>>
>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>>> silvianhadoop@gmail.com> wrote:
>>>
>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>>> usual which ends up because of allowing too much job history kept in
>>>> jobtracker.
>>>>
>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>> around 6k-ish mappers)
>>>> Job initialization was success, and finished the job.
>>>>
>>>> However, before the job is actually running, i can't access to the
>>>> jobtracker page anymore same symptom as above.
>>>>
>>>> i see bunch of this in jobtracker log
>>>>
>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>> ..
>>>> ..
>>>> ..
>>>>
>>>> Until i see this
>>>>
>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>> LOCALITY_WAIT_FACTOR=1.0
>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>> 10 reduce tasks.
>>>>
>>>> that's when i can access to the jobtracker page again.
>>>>
>>>>
>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>> 1ish gig from 40
>>>> network bandwidth is far from filled up.
>>>>
>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>> Environment (build 1.6.0_32-b05)
>>>>
>>>>
>>>> What would be the root cause i should looking at or at least where to
>>>> start?
>>>>
>>>> Thanks you in advanced
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Update, after adjust the network routing, dns query speed is in micro sec
as suppose to be. the issue is completely solve.
Jobtracker page doesn't hang anymore when launch 100k mappers job..

Cheers,



On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> Ok, after some sweat, i think I found the root cause but still need
> another team to help me fix it.
> It lines on the DNS.  Somehow each of the tip:task line, through the
> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>
> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>
> 127 ms
>
> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>
> 126 ms
>
> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>
>
> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>
> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>
> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>
> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>
>
> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>
> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>
> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
> (189)
>
> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>
>
> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>
> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>
> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
> (189)
>
> I looked at the jobtracker log in other datacenter when submitted with the
> same query. Timestamp in each tip:task line is much much faster.
>
> The question that raise here is the job initialization is really request
> the DNS, if so is there any way to suppress that. topology file is already
> in place where name and ip are already there.
>
>
> Hope this make sense
>
> Patai
>
>
>
>
> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
> silvianhadoop@gmail.com> wrote:
>
>> Appreciate your input Bryant, i will try to reproduce and see the
>> namenode log before, while, and after it pause.
>> Wish me luck
>>
>>
>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
>> bbeaudreault@hubspot.com> wrote:
>>
>>> When I've had problems with a slow jobtracker, i've found the issue to
>>> be one of the following two (so far) possibilities:
>>>
>>> - long GC pause (I'm guessing this is not it based on your email)
>>> - hdfs is slow
>>>
>>> I haven't dived into the code yet, but circumstantially I've found that
>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>> larger and larger jobs, aside form the size of the splits serialization in
>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>> the 50030 web ui.
>>>
>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>> before and after that pause?
>>>
>>>
>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>>> silvianhadoop@gmail.com> wrote:
>>>
>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>>> usual which ends up because of allowing too much job history kept in
>>>> jobtracker.
>>>>
>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>> around 6k-ish mappers)
>>>> Job initialization was success, and finished the job.
>>>>
>>>> However, before the job is actually running, i can't access to the
>>>> jobtracker page anymore same symptom as above.
>>>>
>>>> i see bunch of this in jobtracker log
>>>>
>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>> ..
>>>> ..
>>>> ..
>>>>
>>>> Until i see this
>>>>
>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>> LOCALITY_WAIT_FACTOR=1.0
>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks and
>>>> 10 reduce tasks.
>>>>
>>>> that's when i can access to the jobtracker page again.
>>>>
>>>>
>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>> 1ish gig from 40
>>>> network bandwidth is far from filled up.
>>>>
>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>> Environment (build 1.6.0_32-b05)
>>>>
>>>>
>>>> What would be the root cause i should looking at or at least where to
>>>> start?
>>>>
>>>> Thanks you in advanced
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Ok, after some sweat, i think I found the root cause but still need another
team to help me fix it.
It lines on the DNS.  Somehow each of the tip:task line, through the
tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.

2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1

127 ms

2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2

126 ms

2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3


20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)

20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)

20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)

20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)


20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)

20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)

20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy (189)

20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)


20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)

20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)

20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz (189)

I looked at the jobtracker log in other datacenter when submitted with the
same query. Timestamp in each tip:task line is much much faster.

The question that raise here is the job initialization is really request
the DNS, if so is there any way to suppress that. topology file is already
in place where name and ip are already there.


Hope this make sense

Patai




On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> Appreciate your input Bryant, i will try to reproduce and see the namenode
> log before, while, and after it pause.
> Wish me luck
>
>
> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
>
>> When I've had problems with a slow jobtracker, i've found the issue to be
>> one of the following two (so far) possibilities:
>>
>> - long GC pause (I'm guessing this is not it based on your email)
>> - hdfs is slow
>>
>> I haven't dived into the code yet, but circumstantially I've found that
>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>> larger and larger jobs, aside form the size of the splits serialization in
>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>> your jobtracker.  This affects other jobs being able to submit, as well as
>> the 50030 web ui.
>>
>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>> do you see a corresponding pause in the namenode logs?  What gets spewed
>> before and after that pause?
>>
>>
>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>> silvianhadoop@gmail.com> wrote:
>>
>>> A while back, i was fighting with the jobtracker page hangs when i
>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>> usual which ends up because of allowing too much job history kept in
>>> jobtracker.
>>>
>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>> around 6k-ish mappers)
>>> Job initialization was success, and finished the job.
>>>
>>> However, before the job is actually running, i can't access to the
>>> jobtracker page anymore same symptom as above.
>>>
>>> i see bunch of this in jobtracker log
>>>
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>> ..
>>> ..
>>> ..
>>>
>>> Until i see this
>>>
>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>> LOCALITY_WAIT_FACTOR=1.0
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
>>> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
>>> reduce tasks.
>>>
>>> that's when i can access to the jobtracker page again.
>>>
>>>
>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>> 1ish gig from 40
>>> network bandwidth is far from filled up.
>>>
>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>> Environment (build 1.6.0_32-b05)
>>>
>>>
>>> What would be the root cause i should looking at or at least where to
>>> start?
>>>
>>> Thanks you in advanced
>>>
>>>
>>>
>>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Ok, after some sweat, i think I found the root cause but still need another
team to help me fix it.
It lines on the DNS.  Somehow each of the tip:task line, through the
tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.

2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1

127 ms

2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2

126 ms

2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3


20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)

20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)

20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)

20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)


20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)

20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)

20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy (189)

20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)


20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)

20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)

20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz (189)

I looked at the jobtracker log in other datacenter when submitted with the
same query. Timestamp in each tip:task line is much much faster.

The question that raise here is the job initialization is really request
the DNS, if so is there any way to suppress that. topology file is already
in place where name and ip are already there.


Hope this make sense

Patai




On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> Appreciate your input Bryant, i will try to reproduce and see the namenode
> log before, while, and after it pause.
> Wish me luck
>
>
> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
>
>> When I've had problems with a slow jobtracker, i've found the issue to be
>> one of the following two (so far) possibilities:
>>
>> - long GC pause (I'm guessing this is not it based on your email)
>> - hdfs is slow
>>
>> I haven't dived into the code yet, but circumstantially I've found that
>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>> larger and larger jobs, aside form the size of the splits serialization in
>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>> your jobtracker.  This affects other jobs being able to submit, as well as
>> the 50030 web ui.
>>
>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>> do you see a corresponding pause in the namenode logs?  What gets spewed
>> before and after that pause?
>>
>>
>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>> silvianhadoop@gmail.com> wrote:
>>
>>> A while back, i was fighting with the jobtracker page hangs when i
>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>> usual which ends up because of allowing too much job history kept in
>>> jobtracker.
>>>
>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>> around 6k-ish mappers)
>>> Job initialization was success, and finished the job.
>>>
>>> However, before the job is actually running, i can't access to the
>>> jobtracker page anymore same symptom as above.
>>>
>>> i see bunch of this in jobtracker log
>>>
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>> ..
>>> ..
>>> ..
>>>
>>> Until i see this
>>>
>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>> LOCALITY_WAIT_FACTOR=1.0
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
>>> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
>>> reduce tasks.
>>>
>>> that's when i can access to the jobtracker page again.
>>>
>>>
>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>> 1ish gig from 40
>>> network bandwidth is far from filled up.
>>>
>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>> Environment (build 1.6.0_32-b05)
>>>
>>>
>>> What would be the root cause i should looking at or at least where to
>>> start?
>>>
>>> Thanks you in advanced
>>>
>>>
>>>
>>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Ok, after some sweat, i think I found the root cause but still need another
team to help me fix it.
It lines on the DNS.  Somehow each of the tip:task line, through the
tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.

2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1

127 ms

2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2

126 ms

2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3


20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)

20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)

20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)

20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)


20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)

20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)

20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy (189)

20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)


20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)

20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)

20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz (189)

I looked at the jobtracker log in other datacenter when submitted with the
same query. Timestamp in each tip:task line is much much faster.

The question that raise here is the job initialization is really request
the DNS, if so is there any way to suppress that. topology file is already
in place where name and ip are already there.


Hope this make sense

Patai




On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> Appreciate your input Bryant, i will try to reproduce and see the namenode
> log before, while, and after it pause.
> Wish me luck
>
>
> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
>
>> When I've had problems with a slow jobtracker, i've found the issue to be
>> one of the following two (so far) possibilities:
>>
>> - long GC pause (I'm guessing this is not it based on your email)
>> - hdfs is slow
>>
>> I haven't dived into the code yet, but circumstantially I've found that
>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>> larger and larger jobs, aside form the size of the splits serialization in
>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>> your jobtracker.  This affects other jobs being able to submit, as well as
>> the 50030 web ui.
>>
>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>> do you see a corresponding pause in the namenode logs?  What gets spewed
>> before and after that pause?
>>
>>
>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>> silvianhadoop@gmail.com> wrote:
>>
>>> A while back, i was fighting with the jobtracker page hangs when i
>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>> usual which ends up because of allowing too much job history kept in
>>> jobtracker.
>>>
>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>> around 6k-ish mappers)
>>> Job initialization was success, and finished the job.
>>>
>>> However, before the job is actually running, i can't access to the
>>> jobtracker page anymore same symptom as above.
>>>
>>> i see bunch of this in jobtracker log
>>>
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>> ..
>>> ..
>>> ..
>>>
>>> Until i see this
>>>
>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>> LOCALITY_WAIT_FACTOR=1.0
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
>>> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
>>> reduce tasks.
>>>
>>> that's when i can access to the jobtracker page again.
>>>
>>>
>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>> 1ish gig from 40
>>> network bandwidth is far from filled up.
>>>
>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>> Environment (build 1.6.0_32-b05)
>>>
>>>
>>> What would be the root cause i should looking at or at least where to
>>> start?
>>>
>>> Thanks you in advanced
>>>
>>>
>>>
>>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Ok, after some sweat, i think I found the root cause but still need another
team to help me fix it.
It lines on the DNS.  Somehow each of the tip:task line, through the
tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.

2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1

127 ms

2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2

126 ms

2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3


20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)

20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)

20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)

20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)


20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)

20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)

20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy (189)

20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)


20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)

20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)

20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz (189)

I looked at the jobtracker log in other datacenter when submitted with the
same query. Timestamp in each tip:task line is much much faster.

The question that raise here is the job initialization is really request
the DNS, if so is there any way to suppress that. topology file is already
in place where name and ip are already there.


Hope this make sense

Patai




On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> Appreciate your input Bryant, i will try to reproduce and see the namenode
> log before, while, and after it pause.
> Wish me luck
>
>
> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
>
>> When I've had problems with a slow jobtracker, i've found the issue to be
>> one of the following two (so far) possibilities:
>>
>> - long GC pause (I'm guessing this is not it based on your email)
>> - hdfs is slow
>>
>> I haven't dived into the code yet, but circumstantially I've found that
>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>> larger and larger jobs, aside form the size of the splits serialization in
>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>> your jobtracker.  This affects other jobs being able to submit, as well as
>> the 50030 web ui.
>>
>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>> do you see a corresponding pause in the namenode logs?  What gets spewed
>> before and after that pause?
>>
>>
>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>> silvianhadoop@gmail.com> wrote:
>>
>>> A while back, i was fighting with the jobtracker page hangs when i
>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>> usual which ends up because of allowing too much job history kept in
>>> jobtracker.
>>>
>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>> around 6k-ish mappers)
>>> Job initialization was success, and finished the job.
>>>
>>> However, before the job is actually running, i can't access to the
>>> jobtracker page anymore same symptom as above.
>>>
>>> i see bunch of this in jobtracker log
>>>
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>> ..
>>> ..
>>> ..
>>>
>>> Until i see this
>>>
>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>> LOCALITY_WAIT_FACTOR=1.0
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
>>> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
>>> reduce tasks.
>>>
>>> that's when i can access to the jobtracker page again.
>>>
>>>
>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>> 1ish gig from 40
>>> network bandwidth is far from filled up.
>>>
>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>> Environment (build 1.6.0_32-b05)
>>>
>>>
>>> What would be the root cause i should looking at or at least where to
>>> start?
>>>
>>> Thanks you in advanced
>>>
>>>
>>>
>>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Appreciate your input Bryant, i will try to reproduce and see the namenode
log before, while, and after it pause.
Wish me luck


On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
<bb...@hubspot.com>wrote:

> When I've had problems with a slow jobtracker, i've found the issue to be
> one of the following two (so far) possibilities:
>
> - long GC pause (I'm guessing this is not it based on your email)
> - hdfs is slow
>
> I haven't dived into the code yet, but circumstantially I've found that
> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
> larger and larger jobs, aside form the size of the splits serialization in
> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
> your jobtracker.  This affects other jobs being able to submit, as well as
> the 50030 web ui.
>
> I'd take a look at your namenode logs.  When the jobtracker logs pause, do
> you see a corresponding pause in the namenode logs?  What gets spewed
> before and after that pause?
>
>
> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
> silvianhadoop@gmail.com> wrote:
>
>> A while back, i was fighting with the jobtracker page hangs when i browse
>> to http://jobtracker:50030 browser doesn't show jobs info as usual which
>> ends up because of allowing too much job history kept in jobtracker.
>>
>> Currently, i am setting up a new cluster 40g heap on the namenode and
>> jobtracker in dedicated machines. Not fun part starts here; a developer
>> tried to test out the cluster by launching a 76k map job (the cluster has
>> around 6k-ish mappers)
>> Job initialization was success, and finished the job.
>>
>> However, before the job is actually running, i can't access to the
>> jobtracker page anymore same symptom as above.
>>
>> i see bunch of this in jobtracker log
>>
>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>> ..
>> ..
>> ..
>>
>> Until i see this
>>
>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>> LOCALITY_WAIT_FACTOR=1.0
>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
>> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
>> reduce tasks.
>>
>> that's when i can access to the jobtracker page again.
>>
>>
>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>> 1ish gig from 40
>> network bandwidth is far from filled up.
>>
>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>> Environment (build 1.6.0_32-b05)
>>
>>
>> What would be the root cause i should looking at or at least where to
>> start?
>>
>> Thanks you in advanced
>>
>>
>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Appreciate your input Bryant, i will try to reproduce and see the namenode
log before, while, and after it pause.
Wish me luck


On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
<bb...@hubspot.com>wrote:

> When I've had problems with a slow jobtracker, i've found the issue to be
> one of the following two (so far) possibilities:
>
> - long GC pause (I'm guessing this is not it based on your email)
> - hdfs is slow
>
> I haven't dived into the code yet, but circumstantially I've found that
> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
> larger and larger jobs, aside form the size of the splits serialization in
> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
> your jobtracker.  This affects other jobs being able to submit, as well as
> the 50030 web ui.
>
> I'd take a look at your namenode logs.  When the jobtracker logs pause, do
> you see a corresponding pause in the namenode logs?  What gets spewed
> before and after that pause?
>
>
> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
> silvianhadoop@gmail.com> wrote:
>
>> A while back, i was fighting with the jobtracker page hangs when i browse
>> to http://jobtracker:50030 browser doesn't show jobs info as usual which
>> ends up because of allowing too much job history kept in jobtracker.
>>
>> Currently, i am setting up a new cluster 40g heap on the namenode and
>> jobtracker in dedicated machines. Not fun part starts here; a developer
>> tried to test out the cluster by launching a 76k map job (the cluster has
>> around 6k-ish mappers)
>> Job initialization was success, and finished the job.
>>
>> However, before the job is actually running, i can't access to the
>> jobtracker page anymore same symptom as above.
>>
>> i see bunch of this in jobtracker log
>>
>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>> ..
>> ..
>> ..
>>
>> Until i see this
>>
>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>> LOCALITY_WAIT_FACTOR=1.0
>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
>> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
>> reduce tasks.
>>
>> that's when i can access to the jobtracker page again.
>>
>>
>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>> 1ish gig from 40
>> network bandwidth is far from filled up.
>>
>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>> Environment (build 1.6.0_32-b05)
>>
>>
>> What would be the root cause i should looking at or at least where to
>> start?
>>
>> Thanks you in advanced
>>
>>
>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Appreciate your input Bryant, i will try to reproduce and see the namenode
log before, while, and after it pause.
Wish me luck


On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
<bb...@hubspot.com>wrote:

> When I've had problems with a slow jobtracker, i've found the issue to be
> one of the following two (so far) possibilities:
>
> - long GC pause (I'm guessing this is not it based on your email)
> - hdfs is slow
>
> I haven't dived into the code yet, but circumstantially I've found that
> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
> larger and larger jobs, aside form the size of the splits serialization in
> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
> your jobtracker.  This affects other jobs being able to submit, as well as
> the 50030 web ui.
>
> I'd take a look at your namenode logs.  When the jobtracker logs pause, do
> you see a corresponding pause in the namenode logs?  What gets spewed
> before and after that pause?
>
>
> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
> silvianhadoop@gmail.com> wrote:
>
>> A while back, i was fighting with the jobtracker page hangs when i browse
>> to http://jobtracker:50030 browser doesn't show jobs info as usual which
>> ends up because of allowing too much job history kept in jobtracker.
>>
>> Currently, i am setting up a new cluster 40g heap on the namenode and
>> jobtracker in dedicated machines. Not fun part starts here; a developer
>> tried to test out the cluster by launching a 76k map job (the cluster has
>> around 6k-ish mappers)
>> Job initialization was success, and finished the job.
>>
>> However, before the job is actually running, i can't access to the
>> jobtracker page anymore same symptom as above.
>>
>> i see bunch of this in jobtracker log
>>
>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>> ..
>> ..
>> ..
>>
>> Until i see this
>>
>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>> LOCALITY_WAIT_FACTOR=1.0
>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
>> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
>> reduce tasks.
>>
>> that's when i can access to the jobtracker page again.
>>
>>
>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>> 1ish gig from 40
>> network bandwidth is far from filled up.
>>
>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>> Environment (build 1.6.0_32-b05)
>>
>>
>> What would be the root cause i should looking at or at least where to
>> start?
>>
>> Thanks you in advanced
>>
>>
>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Patai Sangbutsarakum <si...@gmail.com>.

Appreciate your input Bryant, i will try to reproduce and see the namenode
log before, while, and after it pause.
Wish me luck


On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
<bb...@hubspot.com>wrote:

> When I've had problems with a slow jobtracker, i've found the issue to be
> one of the following two (so far) possibilities:
>
> - long GC pause (I'm guessing this is not it based on your email)
> - hdfs is slow
>
> I haven't dived into the code yet, but circumstantially I've found that
> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
> larger and larger jobs, aside form the size of the splits serialization in
> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
> your jobtracker.  This affects other jobs being able to submit, as well as
> the 50030 web ui.
>
> I'd take a look at your namenode logs.  When the jobtracker logs pause, do
> you see a corresponding pause in the namenode logs?  What gets spewed
> before and after that pause?
>
>
> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
> silvianhadoop@gmail.com> wrote:
>
>> A while back, i was fighting with the jobtracker page hangs when i browse
>> to http://jobtracker:50030 browser doesn't show jobs info as usual which
>> ends up because of allowing too much job history kept in jobtracker.
>>
>> Currently, i am setting up a new cluster 40g heap on the namenode and
>> jobtracker in dedicated machines. Not fun part starts here; a developer
>> tried to test out the cluster by launching a 76k map job (the cluster has
>> around 6k-ish mappers)
>> Job initialization was success, and finished the job.
>>
>> However, before the job is actually running, i can't access to the
>> jobtracker page anymore same symptom as above.
>>
>> i see bunch of this in jobtracker log
>>
>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>> ..
>> ..
>> ..
>>
>> Until i see this
>>
>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>> LOCALITY_WAIT_FACTOR=1.0
>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
>> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
>> reduce tasks.
>>
>> that's when i can access to the jobtracker page again.
>>
>>
>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>> 1ish gig from 40
>> network bandwidth is far from filled up.
>>
>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>> Environment (build 1.6.0_32-b05)
>>
>>
>> What would be the root cause i should looking at or at least where to
>> start?
>>
>> Thanks you in advanced
>>
>>
>>
>>
>

Re: Jobtracker page hangs ..again.

Posted by Bryan Beaudreault <bb...@hubspot.com>.

When I've had problems with a slow jobtracker, i've found the issue to be
one of the following two (so far) possibilities:

- long GC pause (I'm guessing this is not it based on your email)
- hdfs is slow

I haven't dived into the code yet, but circumstantially I've found that
when you submit a job the jobtracker needs to put a bunch of files in hdfs,
such as the job.xml, the job jar, etc.  I'm not sure how this scales with
larger and larger jobs, aside form the size of the splits serialization in
the job.xml, but if your HDFS is slow for any reason it can cause pauses in
your jobtracker.  This affects other jobs being able to submit, as well as
the 50030 web ui.

I'd take a look at your namenode logs.  When the jobtracker logs pause, do
you see a corresponding pause in the namenode logs?  What gets spewed
before and after that pause?

On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> A while back, i was fighting with the jobtracker page hangs when i browse
> to http://jobtracker:50030 browser doesn't show jobs info as usual which
> ends up because of allowing too much job history kept in jobtracker.
>
> Currently, i am setting up a new cluster 40g heap on the namenode and
> jobtracker in dedicated machines. Not fun part starts here; a developer
> tried to test out the cluster by launching a 76k map job (the cluster has
> around 6k-ish mappers)
> Job initialization was success, and finished the job.
>
> However, before the job is actually running, i can't access to the
> jobtracker page anymore same symptom as above.
>
> i see bunch of this in jobtracker log
>
> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
> ..
> ..
> ..
>
> Until i see this
>
> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
> LOCALITY_WAIT_FACTOR=1.0
> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
> reduce tasks.
>
> that's when i can access to the jobtracker page again.
>
>
> CPU on jobtracker is very little load, JTK's Heap is far from full like
> 1ish gig from 40
> network bandwidth is far from filled up.
>
> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
> Environment (build 1.6.0_32-b05)
>
>
> What would be the root cause i should looking at or at least where to
> start?
>
> Thanks you in advanced
>
>
>
>

Re: Jobtracker page hangs ..again.

Posted by Bryan Beaudreault <bb...@hubspot.com>.

When I've had problems with a slow jobtracker, i've found the issue to be
one of the following two (so far) possibilities:

- long GC pause (I'm guessing this is not it based on your email)
- hdfs is slow

I haven't dived into the code yet, but circumstantially I've found that
when you submit a job the jobtracker needs to put a bunch of files in hdfs,
such as the job.xml, the job jar, etc.  I'm not sure how this scales with
larger and larger jobs, aside form the size of the splits serialization in
the job.xml, but if your HDFS is slow for any reason it can cause pauses in
your jobtracker.  This affects other jobs being able to submit, as well as
the 50030 web ui.

I'd take a look at your namenode logs.  When the jobtracker logs pause, do
you see a corresponding pause in the namenode logs?  What gets spewed
before and after that pause?

On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> A while back, i was fighting with the jobtracker page hangs when i browse
> to http://jobtracker:50030 browser doesn't show jobs info as usual which
> ends up because of allowing too much job history kept in jobtracker.
>
> Currently, i am setting up a new cluster 40g heap on the namenode and
> jobtracker in dedicated machines. Not fun part starts here; a developer
> tried to test out the cluster by launching a 76k map job (the cluster has
> around 6k-ish mappers)
> Job initialization was success, and finished the job.
>
> However, before the job is actually running, i can't access to the
> jobtracker page anymore same symptom as above.
>
> i see bunch of this in jobtracker log
>
> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
> ..
> ..
> ..
>
> Until i see this
>
> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
> LOCALITY_WAIT_FACTOR=1.0
> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
> reduce tasks.
>
> that's when i can access to the jobtracker page again.
>
>
> CPU on jobtracker is very little load, JTK's Heap is far from full like
> 1ish gig from 40
> network bandwidth is far from filled up.
>
> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
> Environment (build 1.6.0_32-b05)
>
>
> What would be the root cause i should looking at or at least where to
> start?
>
> Thanks you in advanced
>
>
>
>

Re: Jobtracker page hangs ..again.

Posted by Bryan Beaudreault <bb...@hubspot.com>.

When I've had problems with a slow jobtracker, i've found the issue to be
one of the following two (so far) possibilities:

- long GC pause (I'm guessing this is not it based on your email)
- hdfs is slow

I haven't dived into the code yet, but circumstantially I've found that
when you submit a job the jobtracker needs to put a bunch of files in hdfs,
such as the job.xml, the job jar, etc.  I'm not sure how this scales with
larger and larger jobs, aside form the size of the splits serialization in
the job.xml, but if your HDFS is slow for any reason it can cause pauses in
your jobtracker.  This affects other jobs being able to submit, as well as
the 50030 web ui.

I'd take a look at your namenode logs.  When the jobtracker logs pause, do
you see a corresponding pause in the namenode logs?  What gets spewed
before and after that pause?

On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> A while back, i was fighting with the jobtracker page hangs when i browse
> to http://jobtracker:50030 browser doesn't show jobs info as usual which
> ends up because of allowing too much job history kept in jobtracker.
>
> Currently, i am setting up a new cluster 40g heap on the namenode and
> jobtracker in dedicated machines. Not fun part starts here; a developer
> tried to test out the cluster by launching a 76k map job (the cluster has
> around 6k-ish mappers)
> Job initialization was success, and finished the job.
>
> However, before the job is actually running, i can't access to the
> jobtracker page anymore same symptom as above.
>
> i see bunch of this in jobtracker log
>
> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
> ..
> ..
> ..
>
> Until i see this
>
> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
> LOCALITY_WAIT_FACTOR=1.0
> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
> reduce tasks.
>
> that's when i can access to the jobtracker page again.
>
>
> CPU on jobtracker is very little load, JTK's Heap is far from full like
> 1ish gig from 40
> network bandwidth is far from filled up.
>
> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
> Environment (build 1.6.0_32-b05)
>
>
> What would be the root cause i should looking at or at least where to
> start?
>
> Thanks you in advanced
>
>
>
>

Re: Jobtracker page hangs ..again.

Posted by Bryan Beaudreault <bb...@hubspot.com>.

When I've had problems with a slow jobtracker, i've found the issue to be
one of the following two (so far) possibilities:

- long GC pause (I'm guessing this is not it based on your email)
- hdfs is slow

I haven't dived into the code yet, but circumstantially I've found that
when you submit a job the jobtracker needs to put a bunch of files in hdfs,
such as the job.xml, the job jar, etc.  I'm not sure how this scales with
larger and larger jobs, aside form the size of the splits serialization in
the job.xml, but if your HDFS is slow for any reason it can cause pauses in
your jobtracker.  This affects other jobs being able to submit, as well as
the 50030 web ui.

I'd take a look at your namenode logs.  When the jobtracker logs pause, do
you see a corresponding pause in the namenode logs?  What gets spewed
before and after that pause?

On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> A while back, i was fighting with the jobtracker page hangs when i browse
> to http://jobtracker:50030 browser doesn't show jobs info as usual which
> ends up because of allowing too much job history kept in jobtracker.
>
> Currently, i am setting up a new cluster 40g heap on the namenode and
> jobtracker in dedicated machines. Not fun part starts here; a developer
> tried to test out the cluster by launching a 76k map job (the cluster has
> around 6k-ish mappers)
> Job initialization was success, and finished the job.
>
> However, before the job is actually running, i can't access to the
> jobtracker page anymore same symptom as above.
>
> i see bunch of this in jobtracker log
>
> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
> ..
> ..
> ..
>
> Until i see this
>
> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
> LOCALITY_WAIT_FACTOR=1.0
> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
> reduce tasks.
>
> that's when i can access to the jobtracker page again.
>
>
> CPU on jobtracker is very little load, JTK's Heap is far from full like
> 1ish gig from 40
> network bandwidth is far from filled up.
>
> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
> Environment (build 1.6.0_32-b05)
>
>
> What would be the root cause i should looking at or at least where to
> start?
>
> Thanks you in advanced
>
>
>
>