You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Silnov <si...@sina.com> on 2016/02/24 07:52:16 UTC

MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

hello everyone! I am a greenhand on hadoop.I have one question seeking for your help!

I have some nodes running hadoop 2.6.0.
The cluster's configuration remain default largely.
I run some job on the cluster(especially some job processing a lot of data) every day.
Sometimes, I found my job remain the same progression for a very very long time. So I have to kill the job mannually and re-submit it to the cluster. It works well before(re-submit the job and it run to the end), but something go wrong today.
After I re-submit the same task for 3 times, its running go deadlock(the progression doesn't change for a long time, and each time has a different progress value.e.g.33.01%,45.8%,73.21%).I begin to check the web UI for the hadoop, then I find there are 98 map suspend while all the running reduce task have consumed all the avaliable memory. I stop the yarn and add configuration below into yarn-site.xml and then restart the yarn.
<property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
<value>0.1</value>
<property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
<value>1.0</value>
(wanting the yarn to preempt the reduce task's resource to run suspending map task)
After restart the yarn,I submit the job with the property mapreduce.job.reduce.slowstart.completedmaps=1.
but the same result happen again!!(my job remain the same progress value for a very very long time)I check the web UI for the hadoop again,and find that the suspended map task is newed with the previous note:"TaskAttempt killed because it ran on unusable node node02:21349".
Then I check the resourcemanager's log and find some useful messages below:
******Deactivating Node node02:21349 as it is now LOST.
******node02:21349 Node Transitioned from RUNNING to LOST.I think this may happen because my network across the cluster is not good which cause the RM don't receive the NM's heartbeat in time.But I wonder that why the yarn framework can't preempt the running reduce task's resource to run the suspend map task?(this cause the job remain the same progress value for a very very long time )Any one can help?
Thank you very much!

Re: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Posted by Namikaze Minato <ll...@gmail.com>.
Hello.

I have very rare cases of map and/or reduce tasks stopping or continuing at
a very slow pace (0.1% in an hour). Just a piece of advice here: Instead of
killing the job, kill only the slow/stuck task, it will restart and this
time (hopefully) not get stuck.

Regards,
LLoyd

On 24 February 2016 at 08:56, Varun saxena <va...@huawei.com> wrote:

> Hi Silnov,
>
>
>
> Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ?
>
> I suspect its same.
>
> MAPREDUCE-6513 is marked to go in 2.7.3
>
>
>
> Regards,
>
> Varun Saxena.
>
>
>
>
>
> *From:* Silnov [mailto:silnov@sina.com]
> *Sent:* 24 February 2016 14:52
> *To:* user
> *Subject:* MapReduce job doesn't make any progress for a very very long
> time after one Node become unusable.
>
>
>
> hello everyone! I am a greenhand on hadoop.
>
> I have one question seeking for your help!
>
>
>
>
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of
> data) every day.
> Sometimes, I found my job remain the same progression for a very very long
> time. So I have to kill the job mannually and re-submit it to the cluster.
> It works well before(re-submit the job and it run to the end), but
> something go wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the
> progression doesn't change for a long time, and each time has a different
> progress value.e.g.33.01%,45.8%,73.21%).
>
> I begin to check the web UI for the hadoop, then I find there are 98 map
> suspend while all the running reduce task have consumed all the avaliable
> memory. I stop the yarn and add configuration below into yarn-site.xml and
> then restart the yarn.
> <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
> <value>0.1</value>
> <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
> <value>1.0</value>
> (wanting the yarn to preempt the reduce task's resource to run suspending
> map task)
> After restart the yarn,I submit the job with the property
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value
> for a very very long time)
>
> I check the web UI for the hadoop again,and find that the suspended map
> task is newed with the previous note:"TaskAttempt killed because it ran on
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> ******Deactivating Node node02:21349 as it is now LOST.
> ******node02:21349 Node Transitioned from RUNNING to LOST.
>
> I think this may happen because my network across the cluster is not good
> which cause the RM don't receive the NM's heartbeat in time.
>
> But I wonder that why the yarn framework can't preempt the running reduce
> task's resource to run the suspend map task?(this cause the job remain the
> same progress value for a very very long time )
>
> Any one can help?
> Thank you very much!
>

Re: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Posted by Namikaze Minato <ll...@gmail.com>.
Hello.

I have very rare cases of map and/or reduce tasks stopping or continuing at
a very slow pace (0.1% in an hour). Just a piece of advice here: Instead of
killing the job, kill only the slow/stuck task, it will restart and this
time (hopefully) not get stuck.

Regards,
LLoyd

On 24 February 2016 at 08:56, Varun saxena <va...@huawei.com> wrote:

> Hi Silnov,
>
>
>
> Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ?
>
> I suspect its same.
>
> MAPREDUCE-6513 is marked to go in 2.7.3
>
>
>
> Regards,
>
> Varun Saxena.
>
>
>
>
>
> *From:* Silnov [mailto:silnov@sina.com]
> *Sent:* 24 February 2016 14:52
> *To:* user
> *Subject:* MapReduce job doesn't make any progress for a very very long
> time after one Node become unusable.
>
>
>
> hello everyone! I am a greenhand on hadoop.
>
> I have one question seeking for your help!
>
>
>
>
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of
> data) every day.
> Sometimes, I found my job remain the same progression for a very very long
> time. So I have to kill the job mannually and re-submit it to the cluster.
> It works well before(re-submit the job and it run to the end), but
> something go wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the
> progression doesn't change for a long time, and each time has a different
> progress value.e.g.33.01%,45.8%,73.21%).
>
> I begin to check the web UI for the hadoop, then I find there are 98 map
> suspend while all the running reduce task have consumed all the avaliable
> memory. I stop the yarn and add configuration below into yarn-site.xml and
> then restart the yarn.
> <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
> <value>0.1</value>
> <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
> <value>1.0</value>
> (wanting the yarn to preempt the reduce task's resource to run suspending
> map task)
> After restart the yarn,I submit the job with the property
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value
> for a very very long time)
>
> I check the web UI for the hadoop again,and find that the suspended map
> task is newed with the previous note:"TaskAttempt killed because it ran on
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> ******Deactivating Node node02:21349 as it is now LOST.
> ******node02:21349 Node Transitioned from RUNNING to LOST.
>
> I think this may happen because my network across the cluster is not good
> which cause the RM don't receive the NM's heartbeat in time.
>
> But I wonder that why the yarn framework can't preempt the running reduce
> task's resource to run the suspend map task?(this cause the job remain the
> same progress value for a very very long time )
>
> Any one can help?
> Thank you very much!
>

Re: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Posted by Namikaze Minato <ll...@gmail.com>.
Hello.

I have very rare cases of map and/or reduce tasks stopping or continuing at
a very slow pace (0.1% in an hour). Just a piece of advice here: Instead of
killing the job, kill only the slow/stuck task, it will restart and this
time (hopefully) not get stuck.

Regards,
LLoyd

On 24 February 2016 at 08:56, Varun saxena <va...@huawei.com> wrote:

> Hi Silnov,
>
>
>
> Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ?
>
> I suspect its same.
>
> MAPREDUCE-6513 is marked to go in 2.7.3
>
>
>
> Regards,
>
> Varun Saxena.
>
>
>
>
>
> *From:* Silnov [mailto:silnov@sina.com]
> *Sent:* 24 February 2016 14:52
> *To:* user
> *Subject:* MapReduce job doesn't make any progress for a very very long
> time after one Node become unusable.
>
>
>
> hello everyone! I am a greenhand on hadoop.
>
> I have one question seeking for your help!
>
>
>
>
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of
> data) every day.
> Sometimes, I found my job remain the same progression for a very very long
> time. So I have to kill the job mannually and re-submit it to the cluster.
> It works well before(re-submit the job and it run to the end), but
> something go wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the
> progression doesn't change for a long time, and each time has a different
> progress value.e.g.33.01%,45.8%,73.21%).
>
> I begin to check the web UI for the hadoop, then I find there are 98 map
> suspend while all the running reduce task have consumed all the avaliable
> memory. I stop the yarn and add configuration below into yarn-site.xml and
> then restart the yarn.
> <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
> <value>0.1</value>
> <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
> <value>1.0</value>
> (wanting the yarn to preempt the reduce task's resource to run suspending
> map task)
> After restart the yarn,I submit the job with the property
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value
> for a very very long time)
>
> I check the web UI for the hadoop again,and find that the suspended map
> task is newed with the previous note:"TaskAttempt killed because it ran on
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> ******Deactivating Node node02:21349 as it is now LOST.
> ******node02:21349 Node Transitioned from RUNNING to LOST.
>
> I think this may happen because my network across the cluster is not good
> which cause the RM don't receive the NM's heartbeat in time.
>
> But I wonder that why the yarn framework can't preempt the running reduce
> task's resource to run the suspend map task?(this cause the job remain the
> same progress value for a very very long time )
>
> Any one can help?
> Thank you very much!
>

Re: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Posted by Namikaze Minato <ll...@gmail.com>.
Hello.

I have very rare cases of map and/or reduce tasks stopping or continuing at
a very slow pace (0.1% in an hour). Just a piece of advice here: Instead of
killing the job, kill only the slow/stuck task, it will restart and this
time (hopefully) not get stuck.

Regards,
LLoyd

On 24 February 2016 at 08:56, Varun saxena <va...@huawei.com> wrote:

> Hi Silnov,
>
>
>
> Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ?
>
> I suspect its same.
>
> MAPREDUCE-6513 is marked to go in 2.7.3
>
>
>
> Regards,
>
> Varun Saxena.
>
>
>
>
>
> *From:* Silnov [mailto:silnov@sina.com]
> *Sent:* 24 February 2016 14:52
> *To:* user
> *Subject:* MapReduce job doesn't make any progress for a very very long
> time after one Node become unusable.
>
>
>
> hello everyone! I am a greenhand on hadoop.
>
> I have one question seeking for your help!
>
>
>
>
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of
> data) every day.
> Sometimes, I found my job remain the same progression for a very very long
> time. So I have to kill the job mannually and re-submit it to the cluster.
> It works well before(re-submit the job and it run to the end), but
> something go wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the
> progression doesn't change for a long time, and each time has a different
> progress value.e.g.33.01%,45.8%,73.21%).
>
> I begin to check the web UI for the hadoop, then I find there are 98 map
> suspend while all the running reduce task have consumed all the avaliable
> memory. I stop the yarn and add configuration below into yarn-site.xml and
> then restart the yarn.
> <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
> <value>0.1</value>
> <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
> <value>1.0</value>
> (wanting the yarn to preempt the reduce task's resource to run suspending
> map task)
> After restart the yarn,I submit the job with the property
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value
> for a very very long time)
>
> I check the web UI for the hadoop again,and find that the suspended map
> task is newed with the previous note:"TaskAttempt killed because it ran on
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> ******Deactivating Node node02:21349 as it is now LOST.
> ******node02:21349 Node Transitioned from RUNNING to LOST.
>
> I think this may happen because my network across the cluster is not good
> which cause the RM don't receive the NM's heartbeat in time.
>
> But I wonder that why the yarn framework can't preempt the running reduce
> task's resource to run the suspend map task?(this cause the job remain the
> same progress value for a very very long time )
>
> Any one can help?
> Thank you very much!
>

RE: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Posted by Varun saxena <va...@huawei.com>.
Hi Silnov,

Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ?
I suspect its same.
MAPREDUCE-6513 is marked to go in 2.7.3

Regards,
Varun Saxena.


From: Silnov [mailto:silnov@sina.com]
Sent: 24 February 2016 14:52
To: user
Subject: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.


hello everyone! I am a greenhand on hadoop.

I have one question seeking for your help!





I have some nodes running hadoop 2.6.0.
The cluster's configuration remain default largely.
I run some job on the cluster(especially some job processing a lot of data) every day.
Sometimes, I found my job remain the same progression for a very very long time. So I have to kill the job mannually and re-submit it to the cluster. It works well before(re-submit the job and it run to the end), but something go wrong today.
After I re-submit the same task for 3 times, its running go deadlock(the progression doesn't change for a long time, and each time has a different progress value.e.g.33.01%,45.8%,73.21%).

I begin to check the web UI for the hadoop, then I find there are 98 map suspend while all the running reduce task have consumed all the avaliable memory. I stop the yarn and add configuration below into yarn-site.xml and then restart the yarn.
<property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
<value>0.1</value>
<property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
<value>1.0</value>
(wanting the yarn to preempt the reduce task's resource to run suspending map task)
After restart the yarn,I submit the job with the property mapreduce.job.reduce.slowstart.completedmaps=1.
but the same result happen again!!(my job remain the same progress value for a very very long time)

I check the web UI for the hadoop again,and find that the suspended map task is newed with the previous note:"TaskAttempt killed because it ran on unusable node node02:21349".
Then I check the resourcemanager's log and find some useful messages below:
******Deactivating Node node02:21349 as it is now LOST.
******node02:21349 Node Transitioned from RUNNING to LOST.

I think this may happen because my network across the cluster is not good which cause the RM don't receive the NM's heartbeat in time.

But I wonder that why the yarn framework can't preempt the running reduce task's resource to run the suspend map task?(this cause the job remain the same progress value for a very very long time[https://issues.apache.org/jira/images/icons/emoticons/sad.gif] )

Any one can help?
Thank you very much!

RE: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Posted by Varun saxena <va...@huawei.com>.
Hi Silnov,

Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ?
I suspect its same.
MAPREDUCE-6513 is marked to go in 2.7.3

Regards,
Varun Saxena.


From: Silnov [mailto:silnov@sina.com]
Sent: 24 February 2016 14:52
To: user
Subject: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.


hello everyone! I am a greenhand on hadoop.

I have one question seeking for your help!





I have some nodes running hadoop 2.6.0.
The cluster's configuration remain default largely.
I run some job on the cluster(especially some job processing a lot of data) every day.
Sometimes, I found my job remain the same progression for a very very long time. So I have to kill the job mannually and re-submit it to the cluster. It works well before(re-submit the job and it run to the end), but something go wrong today.
After I re-submit the same task for 3 times, its running go deadlock(the progression doesn't change for a long time, and each time has a different progress value.e.g.33.01%,45.8%,73.21%).

I begin to check the web UI for the hadoop, then I find there are 98 map suspend while all the running reduce task have consumed all the avaliable memory. I stop the yarn and add configuration below into yarn-site.xml and then restart the yarn.
<property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
<value>0.1</value>
<property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
<value>1.0</value>
(wanting the yarn to preempt the reduce task's resource to run suspending map task)
After restart the yarn,I submit the job with the property mapreduce.job.reduce.slowstart.completedmaps=1.
but the same result happen again!!(my job remain the same progress value for a very very long time)

I check the web UI for the hadoop again,and find that the suspended map task is newed with the previous note:"TaskAttempt killed because it ran on unusable node node02:21349".
Then I check the resourcemanager's log and find some useful messages below:
******Deactivating Node node02:21349 as it is now LOST.
******node02:21349 Node Transitioned from RUNNING to LOST.

I think this may happen because my network across the cluster is not good which cause the RM don't receive the NM's heartbeat in time.

But I wonder that why the yarn framework can't preempt the running reduce task's resource to run the suspend map task?(this cause the job remain the same progress value for a very very long time[https://issues.apache.org/jira/images/icons/emoticons/sad.gif] )

Any one can help?
Thank you very much!

RE: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Posted by Varun saxena <va...@huawei.com>.
Hi Silnov,

Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ?
I suspect its same.
MAPREDUCE-6513 is marked to go in 2.7.3

Regards,
Varun Saxena.


From: Silnov [mailto:silnov@sina.com]
Sent: 24 February 2016 14:52
To: user
Subject: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.


hello everyone! I am a greenhand on hadoop.

I have one question seeking for your help!





I have some nodes running hadoop 2.6.0.
The cluster's configuration remain default largely.
I run some job on the cluster(especially some job processing a lot of data) every day.
Sometimes, I found my job remain the same progression for a very very long time. So I have to kill the job mannually and re-submit it to the cluster. It works well before(re-submit the job and it run to the end), but something go wrong today.
After I re-submit the same task for 3 times, its running go deadlock(the progression doesn't change for a long time, and each time has a different progress value.e.g.33.01%,45.8%,73.21%).

I begin to check the web UI for the hadoop, then I find there are 98 map suspend while all the running reduce task have consumed all the avaliable memory. I stop the yarn and add configuration below into yarn-site.xml and then restart the yarn.
<property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
<value>0.1</value>
<property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
<value>1.0</value>
(wanting the yarn to preempt the reduce task's resource to run suspending map task)
After restart the yarn,I submit the job with the property mapreduce.job.reduce.slowstart.completedmaps=1.
but the same result happen again!!(my job remain the same progress value for a very very long time)

I check the web UI for the hadoop again,and find that the suspended map task is newed with the previous note:"TaskAttempt killed because it ran on unusable node node02:21349".
Then I check the resourcemanager's log and find some useful messages below:
******Deactivating Node node02:21349 as it is now LOST.
******node02:21349 Node Transitioned from RUNNING to LOST.

I think this may happen because my network across the cluster is not good which cause the RM don't receive the NM's heartbeat in time.

But I wonder that why the yarn framework can't preempt the running reduce task's resource to run the suspend map task?(this cause the job remain the same progress value for a very very long time[https://issues.apache.org/jira/images/icons/emoticons/sad.gif] )

Any one can help?
Thank you very much!

RE: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Posted by Varun saxena <va...@huawei.com>.
Hi Silnov,

Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ?
I suspect its same.
MAPREDUCE-6513 is marked to go in 2.7.3

Regards,
Varun Saxena.


From: Silnov [mailto:silnov@sina.com]
Sent: 24 February 2016 14:52
To: user
Subject: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.


hello everyone! I am a greenhand on hadoop.

I have one question seeking for your help!





I have some nodes running hadoop 2.6.0.
The cluster's configuration remain default largely.
I run some job on the cluster(especially some job processing a lot of data) every day.
Sometimes, I found my job remain the same progression for a very very long time. So I have to kill the job mannually and re-submit it to the cluster. It works well before(re-submit the job and it run to the end), but something go wrong today.
After I re-submit the same task for 3 times, its running go deadlock(the progression doesn't change for a long time, and each time has a different progress value.e.g.33.01%,45.8%,73.21%).

I begin to check the web UI for the hadoop, then I find there are 98 map suspend while all the running reduce task have consumed all the avaliable memory. I stop the yarn and add configuration below into yarn-site.xml and then restart the yarn.
<property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
<value>0.1</value>
<property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
<value>1.0</value>
(wanting the yarn to preempt the reduce task's resource to run suspending map task)
After restart the yarn,I submit the job with the property mapreduce.job.reduce.slowstart.completedmaps=1.
but the same result happen again!!(my job remain the same progress value for a very very long time)

I check the web UI for the hadoop again,and find that the suspended map task is newed with the previous note:"TaskAttempt killed because it ran on unusable node node02:21349".
Then I check the resourcemanager's log and find some useful messages below:
******Deactivating Node node02:21349 as it is now LOST.
******node02:21349 Node Transitioned from RUNNING to LOST.

I think this may happen because my network across the cluster is not good which cause the RM don't receive the NM's heartbeat in time.

But I wonder that why the yarn framework can't preempt the running reduce task's resource to run the suspend map task?(this cause the job remain the same progress value for a very very long time[https://issues.apache.org/jira/images/icons/emoticons/sad.gif] )

Any one can help?
Thank you very much!