You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by patrick sang <si...@gmail.com> on 2011/10/03 22:53:15 UTC
Monitoring Slow job.
Hi Hadoopers,
I am writing script to detect if there is any running job has been running
longer than X hours.
So far, I use
./hadoop job -jt jobtracker:port -list all |awk '{ if($2==1) print $1 }'
---> to get the list of running JobID
I am finding the way to get how long the job has been running from jobId.
In the web admin page (http://job:50030), we can easily see the duration of
each jobId pretty easily
from *Started at:, ** Running for:*. of each running job.
How do we get the such an information from command line?
hope this make sense.
Thanks you in advance,
-P
Re: Monitoring Slow job.
Posted by patrick sang <si...@gmail.com>.
Hi all,
Thanks for Vitthai your reply leads me to the solution.
Here is what i do, hopefully it is considered contribution back to
community.
1. ./hadoop job -list all |awk '{ if($2==1) print $1 }'
---> to get the list of running JobID
2. ./hadoop job -status JobID
--> file: hdfs://xxx.xx.xxx/zz/aa/job.xml
3. ./hadoop fs -cat hdfs://xxx.xx.xxx/zz/aa/job.xml
here we got mapred.output.dir
4. ./hadoop job -history <mapred.output.dir>
--> Launched At:
5. Get the time diff between "launched at" and now.
I would be much easier to just scrape the
http://jobtracker:50030/jobdetails.jsp?jobid=jobId
or there might be some more elegant way of getting this. It's just how i did
it this time.
Cheers,
P
On Mon, Oct 3, 2011 at 5:41 PM, Vitthal "Suhas" Gogate <
gogate@hortonworks.com> wrote:
> I am not sure there is a easy way to get what you want on command line..
> one
> option is to use following command which would give you verbose job history
> where you can find submit, Launch & Finish time (including duration on
> FinishTime line). I am using hadoop-0.20.205.0 branch. So check if you
> have some such option for the version of hadoop you are using...
>
> I am pasting sample output for my wordcount program,
>
> bin/hadoop job -history <job_output_directory_on_hdfs>
>
> ======================================
>
> horton-mac:hadoop-0.20.205.0 vgogate$ bin/hadoop job -history output
> Warning: $HADOOP_HOME is deprecated.
>
>
> Hadoop job: 0001_1317688277686_vgogate
> =====================================
> Job tracker host name: job
> job tracker start time: Sun May 16 08:53:51 PDT 1976
> User: vgogate
> JobName: word count
> JobConf:
>
> hdfs://horton-mac.local:54310/tmp/mapred/staging/vgogate/.staging/job_201110031726_0001/job.xml
> Submitted At: 3-Oct-2011 17:31:17
> Launched At: 3-Oct-2011 17:31:17 (0sec)
> Finished At: 3-Oct-2011 17:31:50 (32sec)
> Status: SUCCESS
> Counters:
>
> |Group Name |Counter name |Map Value
> |Reduce Value|Total Value|
>
> ---------------------------------------------------------------------------------------
> |Job Counters |Launched reduce tasks |0
> |0 |1
> |Job Counters |SLOTS_MILLIS_MAPS |0
> |0 |12,257
> |Job Counters |Total time spent by all reduces waiting
> after reserving slots (ms)|0 |0 |0
> |Job Counters |Total time spent by all maps waiting after
> reserving slots (ms)|0 |0 |0
> |Job Counters |Launched map tasks |0
> |0 |1
> |Job Counters |Data-local map tasks |0
> |0 |1
> |Job Counters |SLOTS_MILLIS_REDUCES |0
> |0 |10,082
> |File Output Format Counters |Bytes Written |0
> |61,192 |61,192
> |FileSystemCounters |FILE_BYTES_READ |0
> |70,766 |70,766
> |FileSystemCounters |HDFS_BYTES_READ |112,056
> |0 |112,056
> |FileSystemCounters |FILE_BYTES_WRITTEN |92,325
> |92,294 |184,619
> |FileSystemCounters |HDFS_BYTES_WRITTEN |0
> |61,192 |61,192
> |File Input Format Counters |Bytes Read |111,933
> |0 |111,933
> |Map-Reduce Framework |Reduce input groups |0
> |2,411 |2,411
> |Map-Reduce Framework |Map output materialized bytes |70,766
> |0 |70,766
> |Map-Reduce Framework |Combine output records |2,411
> |0 |2,411
> |Map-Reduce Framework |Map input records |2,643
> |0 |2,643
> |Map-Reduce Framework |Reduce shuffle bytes |0
> |0 |0
> |Map-Reduce Framework |Reduce output records |0
> |2,411 |2,411
> |Map-Reduce Framework |Spilled Records |2,411
> |2,411 |4,822
> |Map-Reduce Framework |Map output bytes |120,995
> |0 |120,995
> |Map-Reduce Framework |Combine input records |5,849
> |0 |5,849
> |Map-Reduce Framework |Map output records |5,849
> |0 |5,849
> |Map-Reduce Framework |SPLIT_RAW_BYTES |123
> |0 |123
> |Map-Reduce Framework |Reduce input records |0
> |2,411 |2,411
> =====================================
>
> Task Summary
> ============================
> Kind Total Successful Failed Killed StartTime FinishTime
>
> Setup 1 1 0 0 3-Oct-2011 17:31:20 3-Oct-2011 17:31:24
> (4sec)
> Map 1 1 0 0 3-Oct-2011 17:31:26 3-Oct-2011 17:31:30
> (4sec)
> Reduce 1 1 0 0 3-Oct-2011 17:31:32 3-Oct-2011
> 17:31:42
> (10sec)
> Cleanup 1 1 0 0 3-Oct-2011 17:31:44 3-Oct-2011
> 17:31:48 (4sec)
> ============================
>
>
> Analysis
> =========
>
> Time taken by best performing map task task_201110031726_0001_m_000000:
> 4sec
> Average time taken by map tasks: 4sec
> Worse performing map tasks:
> TaskId Timetaken
> task_201110031726_0001_m_000000 4sec
> The last map task task_201110031726_0001_m_000000 finished at (relative to
> the Job launch time): 3-Oct-2011 17:31:30 (12sec)
>
> Time taken by best performing shuffle task task_201110031726_0001_r_000000:
> 7sec
> Average time taken by shuffle tasks: 7sec
> Worse performing shuffle tasks:
> TaskId Timetaken
> task_201110031726_0001_r_000000 7sec
> The last shuffle task task_201110031726_0001_r_000000 finished at (relative
> to the Job launch time): 3-Oct-2011 17:31:39 (21sec)
>
> Time taken by best performing reduce task task_201110031726_0001_r_000000:
> 2sec
> Average time taken by reduce tasks: 2sec
> Worse performing reduce tasks:
> TaskId Timetaken
> task_201110031726_0001_r_000000 2sec
> The last reduce task task_201110031726_0001_r_000000 finished at (relative
> to the Job launch time): 3-Oct-2011 17:31:42 (24sec)
> =========
>
> FAILED SETUP task list for 0001_1317688277686_vgogate
> TaskId StartTime FinishTime Error
> ====================================================
>
> KILLED SETUP task list for 0001_1317688277686_vgogate
> TaskId StartTime FinishTime Error
> ====================================================
>
> FAILED MAP task list for 0001_1317688277686_vgogate
> TaskId StartTime FinishTime Error InputSplits
> ====================================================
>
> KILLED MAP task list for 0001_1317688277686_vgogate
> TaskId StartTime FinishTime Error InputSplits
> ====================================================
>
> FAILED REDUCE task list for 0001_1317688277686_vgogate
> TaskId StartTime FinishTime Error
> ====================================================
>
> KILLED REDUCE task list for 0001_1317688277686_vgogate
> TaskId StartTime FinishTime Error
> ====================================================
>
> FAILED CLEANUP task list for 0001_1317688277686_vgogate
> TaskId StartTime FinishTime Error
> ====================================================
>
> KILLED CLEANUP task list for 0001_1317688277686_vgogate
> TaskId StartTime FinishTime Error
> ====================================================
>
> ======================================
>
> On Mon, Oct 3, 2011 at 1:53 PM, patrick sang <silvianhadoop@gmail.com
> >wrote:
>
> > Hi Hadoopers,
> >
> > I am writing script to detect if there is any running job has been
> running
> > longer than X hours.
> >
> > So far, I use
> > ./hadoop job -jt jobtracker:port -list all |awk '{ if($2==1) print $1 }'
> > ---> to get the list of running JobID
> >
> > I am finding the way to get how long the job has been running from jobId.
> >
> > In the web admin page (http://job:50030), we can easily see the duration
> > of
> > each jobId pretty easily
> > from *Started at:, ** Running for:*. of each running job.
> >
> > How do we get the such an information from command line?
> >
> > hope this make sense.
> > Thanks you in advance,
> >
> > -P
> >
>
Re: Monitoring Slow job.
Posted by "Vitthal \"Suhas\" Gogate" <go...@hortonworks.com>.
I am not sure there is a easy way to get what you want on command line.. one
option is to use following command which would give you verbose job history
where you can find submit, Launch & Finish time (including duration on
FinishTime line). I am using hadoop-0.20.205.0 branch. So check if you
have some such option for the version of hadoop you are using...
I am pasting sample output for my wordcount program,
bin/hadoop job -history <job_output_directory_on_hdfs>
======================================
horton-mac:hadoop-0.20.205.0 vgogate$ bin/hadoop job -history output
Warning: $HADOOP_HOME is deprecated.
Hadoop job: 0001_1317688277686_vgogate
=====================================
Job tracker host name: job
job tracker start time: Sun May 16 08:53:51 PDT 1976
User: vgogate
JobName: word count
JobConf:
hdfs://horton-mac.local:54310/tmp/mapred/staging/vgogate/.staging/job_201110031726_0001/job.xml
Submitted At: 3-Oct-2011 17:31:17
Launched At: 3-Oct-2011 17:31:17 (0sec)
Finished At: 3-Oct-2011 17:31:50 (32sec)
Status: SUCCESS
Counters:
|Group Name |Counter name |Map Value
|Reduce Value|Total Value|
---------------------------------------------------------------------------------------
|Job Counters |Launched reduce tasks |0
|0 |1
|Job Counters |SLOTS_MILLIS_MAPS |0
|0 |12,257
|Job Counters |Total time spent by all reduces waiting
after reserving slots (ms)|0 |0 |0
|Job Counters |Total time spent by all maps waiting after
reserving slots (ms)|0 |0 |0
|Job Counters |Launched map tasks |0
|0 |1
|Job Counters |Data-local map tasks |0
|0 |1
|Job Counters |SLOTS_MILLIS_REDUCES |0
|0 |10,082
|File Output Format Counters |Bytes Written |0
|61,192 |61,192
|FileSystemCounters |FILE_BYTES_READ |0
|70,766 |70,766
|FileSystemCounters |HDFS_BYTES_READ |112,056
|0 |112,056
|FileSystemCounters |FILE_BYTES_WRITTEN |92,325
|92,294 |184,619
|FileSystemCounters |HDFS_BYTES_WRITTEN |0
|61,192 |61,192
|File Input Format Counters |Bytes Read |111,933
|0 |111,933
|Map-Reduce Framework |Reduce input groups |0
|2,411 |2,411
|Map-Reduce Framework |Map output materialized bytes |70,766
|0 |70,766
|Map-Reduce Framework |Combine output records |2,411
|0 |2,411
|Map-Reduce Framework |Map input records |2,643
|0 |2,643
|Map-Reduce Framework |Reduce shuffle bytes |0
|0 |0
|Map-Reduce Framework |Reduce output records |0
|2,411 |2,411
|Map-Reduce Framework |Spilled Records |2,411
|2,411 |4,822
|Map-Reduce Framework |Map output bytes |120,995
|0 |120,995
|Map-Reduce Framework |Combine input records |5,849
|0 |5,849
|Map-Reduce Framework |Map output records |5,849
|0 |5,849
|Map-Reduce Framework |SPLIT_RAW_BYTES |123
|0 |123
|Map-Reduce Framework |Reduce input records |0
|2,411 |2,411
=====================================
Task Summary
============================
Kind Total Successful Failed Killed StartTime FinishTime
Setup 1 1 0 0 3-Oct-2011 17:31:20 3-Oct-2011 17:31:24
(4sec)
Map 1 1 0 0 3-Oct-2011 17:31:26 3-Oct-2011 17:31:30
(4sec)
Reduce 1 1 0 0 3-Oct-2011 17:31:32 3-Oct-2011 17:31:42
(10sec)
Cleanup 1 1 0 0 3-Oct-2011 17:31:44 3-Oct-2011
17:31:48 (4sec)
============================
Analysis
=========
Time taken by best performing map task task_201110031726_0001_m_000000: 4sec
Average time taken by map tasks: 4sec
Worse performing map tasks:
TaskId Timetaken
task_201110031726_0001_m_000000 4sec
The last map task task_201110031726_0001_m_000000 finished at (relative to
the Job launch time): 3-Oct-2011 17:31:30 (12sec)
Time taken by best performing shuffle task task_201110031726_0001_r_000000:
7sec
Average time taken by shuffle tasks: 7sec
Worse performing shuffle tasks:
TaskId Timetaken
task_201110031726_0001_r_000000 7sec
The last shuffle task task_201110031726_0001_r_000000 finished at (relative
to the Job launch time): 3-Oct-2011 17:31:39 (21sec)
Time taken by best performing reduce task task_201110031726_0001_r_000000:
2sec
Average time taken by reduce tasks: 2sec
Worse performing reduce tasks:
TaskId Timetaken
task_201110031726_0001_r_000000 2sec
The last reduce task task_201110031726_0001_r_000000 finished at (relative
to the Job launch time): 3-Oct-2011 17:31:42 (24sec)
=========
FAILED SETUP task list for 0001_1317688277686_vgogate
TaskId StartTime FinishTime Error
====================================================
KILLED SETUP task list for 0001_1317688277686_vgogate
TaskId StartTime FinishTime Error
====================================================
FAILED MAP task list for 0001_1317688277686_vgogate
TaskId StartTime FinishTime Error InputSplits
====================================================
KILLED MAP task list for 0001_1317688277686_vgogate
TaskId StartTime FinishTime Error InputSplits
====================================================
FAILED REDUCE task list for 0001_1317688277686_vgogate
TaskId StartTime FinishTime Error
====================================================
KILLED REDUCE task list for 0001_1317688277686_vgogate
TaskId StartTime FinishTime Error
====================================================
FAILED CLEANUP task list for 0001_1317688277686_vgogate
TaskId StartTime FinishTime Error
====================================================
KILLED CLEANUP task list for 0001_1317688277686_vgogate
TaskId StartTime FinishTime Error
====================================================
======================================
On Mon, Oct 3, 2011 at 1:53 PM, patrick sang <si...@gmail.com>wrote:
> Hi Hadoopers,
>
> I am writing script to detect if there is any running job has been running
> longer than X hours.
>
> So far, I use
> ./hadoop job -jt jobtracker:port -list all |awk '{ if($2==1) print $1 }'
> ---> to get the list of running JobID
>
> I am finding the way to get how long the job has been running from jobId.
>
> In the web admin page (http://job:50030), we can easily see the duration
> of
> each jobId pretty easily
> from *Started at:, ** Running for:*. of each running job.
>
> How do we get the such an information from command line?
>
> hope this make sense.
> Thanks you in advance,
>
> -P
>