You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by 王国栋 <wa...@gmail.com> on 2013/05/23 08:50:30 UTC

slave don't update task status correctly

Hi folks,

During our test about hadoop framework, I find that one of the slave does
not update task status correctly to master. As a result, the slave hangs
and can not launch new tasktracker for the incoming job.

The web UI of the slave is like this, and we can see that slave believes
task93 and task102 are running. But actually these 2 tasktracker are
shutdown.

Executors ID <http://hd1dz.prod.mediav.com:5050/>
Name<http://hd1dz.prod.mediav.com:5050/>
  Source <http://hd1dz.prod.mediav.com:5050/>  Active
Tasks<http://hd1dz.prod.mediav.com:5050/>
  Queued Tasks <http://hd1dz.prod.mediav.com:5050/>  CPUs (Used /
Allocated)<http://hd1dz.prod.mediav.com:5050/>
  Mem (Used / Allocated) <http://hd1dz.prod.mediav.com:5050/>  Sandbox
executor_Task_Tracker_93<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_93>
Hadoop
TaskTracker Task_Tracker_93 0 0 6.635 / 1 5 GB / 1 GB
browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_93%2Fruns%2F19a8d258-fde4-43dc-80be-4280cec442bb>
executor_Task_Tracker_124<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_124>
Hadoop
TaskTracker Task_Tracker_124 0 1 / 1 / 1 GB
browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_124%2Fruns%2F6c4f0869-facd-4711-b962-8603d4023647>
executor_Task_Tracker_115<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_115>
Hadoop
TaskTracker Task_Tracker_115 0 1 / 1 / 1 GB
browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_115%2Fruns%2F624ce615-4a25-4d15-b658-ed0b2172919a>
executor_Task_Tracker_102<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_102>
Hadoop
TaskTracker Task_Tracker_102 0 0 8.886 / 1 6 GB / 1 GB
browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_102%2Fruns%2F242f8569-c112-40d2-9709-e6f758fbfb05>

I can check the log for task93, I am sure the log shows that the
tasktracker is shutdown gracefully. it is as follow


13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000233_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000075_0 not
found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000041_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000003_0 not
found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000226_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000223_0 not
found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000246_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000124_0 not
found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000019_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000169_0 not
found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000134_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000076_0 not
found in cache 13/05/23 09:50:30 INFO mapred.UserLogCleaner: Adding
job_201305221443_0242 for user-log deletion with
retainTimeStamp:1369360230977 13/05/23 09:50:30 INFO util.AsyncDiskService:
Shutting down all AsyncDiskService threads... 13/05/23 09:50:30 INFO
util.AsyncDiskService: All AsyncDiskService threads are terminated.
13/05/23 09:50:30 INFO util.MRAsyncDiskService: Deleting toBeDeleted
directory.


Is this related to this param "--executor_shutdown_grace_period", I can see
the default value is 5 seconds, if the executor shutdown after 5 seconds,
what will happen then?


Thanks.

Guodong

Re: slave don't update task status correctly

Posted by 王国栋 <wa...@gmail.com>.

Hi Vinod,

I find there are some WARNING log in the slave log. I think the reason may
be related with "Failed to collect resource usage for executor".

The content is like this.

W0523 00:19:35.861829 14117 process_isolator.cpp:402] Failed to get status
of descendant process 188
74 of parent 20448: Failed to open '/proc/18874/stat'
W0523 00:49:54.263650 14118 process_isolator.cpp:402] Failed to get status
of descendant process 245
63 of parent 20990: Failed to open '/proc/24563/stat'
W0523 01:00:52.704118 14108 process_isolator.cpp:402] Failed to get status
of descendant process 546
2 of parent 20990: Failed to open '/proc/5462/stat'
W0523 04:25:26.100183 14116 monitor.cpp:167] Failed to collect resource
usage for executor 'executor
_Task_Tracker_93' of framework '201305221443-252063498-5050-2128-0000': 0
W0523 04:25:27.095105 14106 monitor.cpp:167] Failed to collect resource
usage for executor 'executor
_Task_Tracker_102' of framework '201305221443-252063498-5050-2128-0000': 0
W0523 04:25:31.101133 14106 monitor.cpp:167] Failed to collect resource
usage for executor 'executor
_Task_Tracker_93' of framework '201305221443-252063498-5050-2128-0000': 0
W0523 04:25:32.096012 14106 monitor.cpp:167] Failed to collect resource
usage for executor 'executor
_Task_Tracker_102' of framework '201305221443-252063498-5050-2128-0000': 0



Guodong


On Fri, May 24, 2013 at 1:48 AM, Vinod Kone <vi...@gmail.com> wrote:

> I unfortunately cannot access the web links you pasted. It would be much
> better if you can just paste the slave logs, so that I can diagnose.
>
> Is this related to this param "--executor_shutdown_grace_period", I can see
> > the default value is 5 seconds, if the executor shutdown after 5 seconds,
> > what will happen then?
> >
> >
> The grace period is used when the slave tries to shutdown an executor. The
> slave typically shuts down an executor if the framework is shutting down or
> if the slave itself is shutting down. After sending a shutdown to the
> executor, the slave expects the executor process to terminate. If it
> doesn't terminate within "executor_shutdown_grace_period" duration, then it
> will issue a unix 'kill'.
>
>
>
> >
> > Thanks.
> >
> > Guodong
> >
>

Re: slave don't update task status correctly

Posted by Vinod Kone <vi...@gmail.com>.

I unfortunately cannot access the web links you pasted. It would be much
better if you can just paste the slave logs, so that I can diagnose.

Is this related to this param "--executor_shutdown_grace_period", I can see
> the default value is 5 seconds, if the executor shutdown after 5 seconds,
> what will happen then?
>
>
The grace period is used when the slave tries to shutdown an executor. The
slave typically shuts down an executor if the framework is shutting down or
if the slave itself is shutting down. After sending a shutdown to the
executor, the slave expects the executor process to terminate. If it
doesn't terminate within "executor_shutdown_grace_period" duration, then it
will issue a unix 'kill'.



>
> Thanks.
>
> Guodong
>