You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by patrick sang <si...@gmail.com> on 2011/10/03 22:53:15 UTC

Monitoring Slow job.

Hi Hadoopers,

I am writing script to detect if there is any running job has been running
longer than X hours.

So far,  I use
./hadoop job -jt jobtracker:port -list all |awk '{ if($2==1) print $1 }'
---> to get the list of running JobID

I am finding the way to get how long the job has been running from jobId.

In the web admin page (http://job:50030), we can easily see the duration of
each jobId pretty easily
from *Started at:, ** Running for:*. of each running job.

How do we get the such an information from command line?

hope this make sense.
Thanks you in advance,

-P

Re: Monitoring Slow job.

Posted by patrick sang <si...@gmail.com>.
Hi all,
Thanks for Vitthai your reply leads me to the solution.
Here is what i do, hopefully it is considered contribution back to
community.

1. ./hadoop job -list all |awk '{ if($2==1) print $1 }'
---> to get the list of running JobID

2. ./hadoop job -status JobID
--> file: hdfs://xxx.xx.xxx/zz/aa/job.xml

3. ./hadoop fs -cat hdfs://xxx.xx.xxx/zz/aa/job.xml
here we got mapred.output.dir

4. ./hadoop job -history <mapred.output.dir>
--> Launched At:

5. Get the time diff between "launched at" and now.

I would be much easier to just scrape the
http://jobtracker:50030/jobdetails.jsp?jobid=jobId
or there might be some more elegant way of getting this. It's just how i did
it this time.

Cheers,
P

On Mon, Oct 3, 2011 at 5:41 PM, Vitthal "Suhas" Gogate <
gogate@hortonworks.com> wrote:

> I am not sure there is a easy way to get what you want on command line..
> one
> option is to use following command which would give you verbose job history
> where you can find submit, Launch & Finish time (including duration on
> FinishTime line).  I am using hadoop-0.20.205.0  branch. So check if you
> have some such option for the version of hadoop you are using...
>
> I am pasting sample output for my wordcount program,
>
> bin/hadoop job -history <job_output_directory_on_hdfs>
>
> ======================================
>
> horton-mac:hadoop-0.20.205.0 vgogate$ bin/hadoop job -history output
> Warning: $HADOOP_HOME is deprecated.
>
>
> Hadoop job: 0001_1317688277686_vgogate
> =====================================
> Job tracker host name: job
> job tracker start time: Sun May 16 08:53:51 PDT 1976
> User: vgogate
> JobName: word count
> JobConf:
>
> hdfs://horton-mac.local:54310/tmp/mapred/staging/vgogate/.staging/job_201110031726_0001/job.xml
> Submitted At: 3-Oct-2011 17:31:17
> Launched At: 3-Oct-2011 17:31:17 (0sec)
> Finished At: 3-Oct-2011 17:31:50 (32sec)
> Status: SUCCESS
> Counters:
>
> |Group Name                    |Counter name                  |Map Value
> |Reduce Value|Total Value|
>
> ---------------------------------------------------------------------------------------
> |Job Counters                  |Launched reduce tasks         |0
> |0         |1
> |Job Counters                  |SLOTS_MILLIS_MAPS             |0
> |0         |12,257
> |Job Counters                  |Total time spent by all reduces waiting
> after reserving slots (ms)|0         |0         |0
> |Job Counters                  |Total time spent by all maps waiting after
> reserving slots (ms)|0         |0         |0
> |Job Counters                  |Launched map tasks            |0
> |0         |1
> |Job Counters                  |Data-local map tasks          |0
> |0         |1
> |Job Counters                  |SLOTS_MILLIS_REDUCES          |0
> |0         |10,082
> |File Output Format Counters   |Bytes Written                 |0
> |61,192    |61,192
> |FileSystemCounters            |FILE_BYTES_READ               |0
> |70,766    |70,766
> |FileSystemCounters            |HDFS_BYTES_READ               |112,056
> |0         |112,056
> |FileSystemCounters            |FILE_BYTES_WRITTEN            |92,325
> |92,294    |184,619
> |FileSystemCounters            |HDFS_BYTES_WRITTEN            |0
> |61,192    |61,192
> |File Input Format Counters    |Bytes Read                    |111,933
> |0         |111,933
> |Map-Reduce Framework          |Reduce input groups           |0
> |2,411     |2,411
> |Map-Reduce Framework          |Map output materialized bytes |70,766
> |0         |70,766
> |Map-Reduce Framework          |Combine output records        |2,411
> |0         |2,411
> |Map-Reduce Framework          |Map input records             |2,643
> |0         |2,643
> |Map-Reduce Framework          |Reduce shuffle bytes          |0
> |0         |0
> |Map-Reduce Framework          |Reduce output records         |0
> |2,411     |2,411
> |Map-Reduce Framework          |Spilled Records               |2,411
> |2,411     |4,822
> |Map-Reduce Framework          |Map output bytes              |120,995
> |0         |120,995
> |Map-Reduce Framework          |Combine input records         |5,849
> |0         |5,849
> |Map-Reduce Framework          |Map output records            |5,849
> |0         |5,849
> |Map-Reduce Framework          |SPLIT_RAW_BYTES               |123
> |0         |123
> |Map-Reduce Framework          |Reduce input records          |0
> |2,411     |2,411
> =====================================
>
> Task Summary
> ============================
> Kind    Total    Successful    Failed    Killed    StartTime    FinishTime
>
> Setup    1    1        0    0    3-Oct-2011 17:31:20    3-Oct-2011 17:31:24
> (4sec)
> Map    1    1        0    0    3-Oct-2011 17:31:26    3-Oct-2011 17:31:30
> (4sec)
> Reduce    1    1        0    0    3-Oct-2011 17:31:32    3-Oct-2011
> 17:31:42
> (10sec)
> Cleanup    1    1        0    0    3-Oct-2011 17:31:44    3-Oct-2011
> 17:31:48 (4sec)
> ============================
>
>
> Analysis
> =========
>
> Time taken by best performing map task task_201110031726_0001_m_000000:
> 4sec
> Average time taken by map tasks: 4sec
> Worse performing map tasks:
> TaskId        Timetaken
> task_201110031726_0001_m_000000 4sec
> The last map task task_201110031726_0001_m_000000 finished at (relative to
> the Job launch time): 3-Oct-2011 17:31:30 (12sec)
>
> Time taken by best performing shuffle task task_201110031726_0001_r_000000:
> 7sec
> Average time taken by shuffle tasks: 7sec
> Worse performing shuffle tasks:
> TaskId        Timetaken
> task_201110031726_0001_r_000000 7sec
> The last shuffle task task_201110031726_0001_r_000000 finished at (relative
> to the Job launch time): 3-Oct-2011 17:31:39 (21sec)
>
> Time taken by best performing reduce task task_201110031726_0001_r_000000:
> 2sec
> Average time taken by reduce tasks: 2sec
> Worse performing reduce tasks:
> TaskId        Timetaken
> task_201110031726_0001_r_000000 2sec
> The last reduce task task_201110031726_0001_r_000000 finished at (relative
> to the Job launch time): 3-Oct-2011 17:31:42 (24sec)
> =========
>
> FAILED SETUP task list for 0001_1317688277686_vgogate
> TaskId        StartTime    FinishTime    Error
> ====================================================
>
> KILLED SETUP task list for 0001_1317688277686_vgogate
> TaskId        StartTime    FinishTime    Error
> ====================================================
>
> FAILED MAP task list for 0001_1317688277686_vgogate
> TaskId        StartTime    FinishTime    Error    InputSplits
> ====================================================
>
> KILLED MAP task list for 0001_1317688277686_vgogate
> TaskId        StartTime    FinishTime    Error    InputSplits
> ====================================================
>
> FAILED REDUCE task list for 0001_1317688277686_vgogate
> TaskId        StartTime    FinishTime    Error
> ====================================================
>
> KILLED REDUCE task list for 0001_1317688277686_vgogate
> TaskId        StartTime    FinishTime    Error
> ====================================================
>
> FAILED CLEANUP task list for 0001_1317688277686_vgogate
> TaskId        StartTime    FinishTime    Error
> ====================================================
>
> KILLED CLEANUP task list for 0001_1317688277686_vgogate
> TaskId        StartTime    FinishTime    Error
> ====================================================
>
> ======================================
>
> On Mon, Oct 3, 2011 at 1:53 PM, patrick sang <silvianhadoop@gmail.com
> >wrote:
>
> > Hi Hadoopers,
> >
> > I am writing script to detect if there is any running job has been
> running
> > longer than X hours.
> >
> > So far,  I use
> > ./hadoop job -jt jobtracker:port -list all |awk '{ if($2==1) print $1 }'
> > ---> to get the list of running JobID
> >
> > I am finding the way to get how long the job has been running from jobId.
> >
> > In the web admin page (http://job:50030), we can easily see the duration
> > of
> > each jobId pretty easily
> > from *Started at:, ** Running for:*. of each running job.
> >
> > How do we get the such an information from command line?
> >
> > hope this make sense.
> > Thanks you in advance,
> >
> > -P
> >
>

Re: Monitoring Slow job.

Posted by "Vitthal \"Suhas\" Gogate" <go...@hortonworks.com>.
I am not sure there is a easy way to get what you want on command line.. one
option is to use following command which would give you verbose job history
where you can find submit, Launch & Finish time (including duration on
FinishTime line).  I am using hadoop-0.20.205.0  branch. So check if you
have some such option for the version of hadoop you are using...

I am pasting sample output for my wordcount program,

bin/hadoop job -history <job_output_directory_on_hdfs>

======================================

horton-mac:hadoop-0.20.205.0 vgogate$ bin/hadoop job -history output
Warning: $HADOOP_HOME is deprecated.


Hadoop job: 0001_1317688277686_vgogate
=====================================
Job tracker host name: job
job tracker start time: Sun May 16 08:53:51 PDT 1976
User: vgogate
JobName: word count
JobConf:
hdfs://horton-mac.local:54310/tmp/mapred/staging/vgogate/.staging/job_201110031726_0001/job.xml
Submitted At: 3-Oct-2011 17:31:17
Launched At: 3-Oct-2011 17:31:17 (0sec)
Finished At: 3-Oct-2011 17:31:50 (32sec)
Status: SUCCESS
Counters:

|Group Name                    |Counter name                  |Map Value
|Reduce Value|Total Value|
---------------------------------------------------------------------------------------
|Job Counters                  |Launched reduce tasks         |0
|0         |1
|Job Counters                  |SLOTS_MILLIS_MAPS             |0
|0         |12,257
|Job Counters                  |Total time spent by all reduces waiting
after reserving slots (ms)|0         |0         |0
|Job Counters                  |Total time spent by all maps waiting after
reserving slots (ms)|0         |0         |0
|Job Counters                  |Launched map tasks            |0
|0         |1
|Job Counters                  |Data-local map tasks          |0
|0         |1
|Job Counters                  |SLOTS_MILLIS_REDUCES          |0
|0         |10,082
|File Output Format Counters   |Bytes Written                 |0
|61,192    |61,192
|FileSystemCounters            |FILE_BYTES_READ               |0
|70,766    |70,766
|FileSystemCounters            |HDFS_BYTES_READ               |112,056
|0         |112,056
|FileSystemCounters            |FILE_BYTES_WRITTEN            |92,325
|92,294    |184,619
|FileSystemCounters            |HDFS_BYTES_WRITTEN            |0
|61,192    |61,192
|File Input Format Counters    |Bytes Read                    |111,933
|0         |111,933
|Map-Reduce Framework          |Reduce input groups           |0
|2,411     |2,411
|Map-Reduce Framework          |Map output materialized bytes |70,766
|0         |70,766
|Map-Reduce Framework          |Combine output records        |2,411
|0         |2,411
|Map-Reduce Framework          |Map input records             |2,643
|0         |2,643
|Map-Reduce Framework          |Reduce shuffle bytes          |0
|0         |0
|Map-Reduce Framework          |Reduce output records         |0
|2,411     |2,411
|Map-Reduce Framework          |Spilled Records               |2,411
|2,411     |4,822
|Map-Reduce Framework          |Map output bytes              |120,995
|0         |120,995
|Map-Reduce Framework          |Combine input records         |5,849
|0         |5,849
|Map-Reduce Framework          |Map output records            |5,849
|0         |5,849
|Map-Reduce Framework          |SPLIT_RAW_BYTES               |123
|0         |123
|Map-Reduce Framework          |Reduce input records          |0
|2,411     |2,411
=====================================

Task Summary
============================
Kind    Total    Successful    Failed    Killed    StartTime    FinishTime

Setup    1    1        0    0    3-Oct-2011 17:31:20    3-Oct-2011 17:31:24
(4sec)
Map    1    1        0    0    3-Oct-2011 17:31:26    3-Oct-2011 17:31:30
(4sec)
Reduce    1    1        0    0    3-Oct-2011 17:31:32    3-Oct-2011 17:31:42
(10sec)
Cleanup    1    1        0    0    3-Oct-2011 17:31:44    3-Oct-2011
17:31:48 (4sec)
============================


Analysis
=========

Time taken by best performing map task task_201110031726_0001_m_000000: 4sec
Average time taken by map tasks: 4sec
Worse performing map tasks:
TaskId        Timetaken
task_201110031726_0001_m_000000 4sec
The last map task task_201110031726_0001_m_000000 finished at (relative to
the Job launch time): 3-Oct-2011 17:31:30 (12sec)

Time taken by best performing shuffle task task_201110031726_0001_r_000000:
7sec
Average time taken by shuffle tasks: 7sec
Worse performing shuffle tasks:
TaskId        Timetaken
task_201110031726_0001_r_000000 7sec
The last shuffle task task_201110031726_0001_r_000000 finished at (relative
to the Job launch time): 3-Oct-2011 17:31:39 (21sec)

Time taken by best performing reduce task task_201110031726_0001_r_000000:
2sec
Average time taken by reduce tasks: 2sec
Worse performing reduce tasks:
TaskId        Timetaken
task_201110031726_0001_r_000000 2sec
The last reduce task task_201110031726_0001_r_000000 finished at (relative
to the Job launch time): 3-Oct-2011 17:31:42 (24sec)
=========

FAILED SETUP task list for 0001_1317688277686_vgogate
TaskId        StartTime    FinishTime    Error
====================================================

KILLED SETUP task list for 0001_1317688277686_vgogate
TaskId        StartTime    FinishTime    Error
====================================================

FAILED MAP task list for 0001_1317688277686_vgogate
TaskId        StartTime    FinishTime    Error    InputSplits
====================================================

KILLED MAP task list for 0001_1317688277686_vgogate
TaskId        StartTime    FinishTime    Error    InputSplits
====================================================

FAILED REDUCE task list for 0001_1317688277686_vgogate
TaskId        StartTime    FinishTime    Error
====================================================

KILLED REDUCE task list for 0001_1317688277686_vgogate
TaskId        StartTime    FinishTime    Error
====================================================

FAILED CLEANUP task list for 0001_1317688277686_vgogate
TaskId        StartTime    FinishTime    Error
====================================================

KILLED CLEANUP task list for 0001_1317688277686_vgogate
TaskId        StartTime    FinishTime    Error
====================================================

======================================

On Mon, Oct 3, 2011 at 1:53 PM, patrick sang <si...@gmail.com>wrote:

> Hi Hadoopers,
>
> I am writing script to detect if there is any running job has been running
> longer than X hours.
>
> So far,  I use
> ./hadoop job -jt jobtracker:port -list all |awk '{ if($2==1) print $1 }'
> ---> to get the list of running JobID
>
> I am finding the way to get how long the job has been running from jobId.
>
> In the web admin page (http://job:50030), we can easily see the duration
> of
> each jobId pretty easily
> from *Started at:, ** Running for:*. of each running job.
>
> How do we get the such an information from command line?
>
> hope this make sense.
> Thanks you in advance,
>
> -P
>