You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Suma Shivaprasad <su...@gmail.com> on 2015/02/03 17:05:02 UTC

QueueMetrics.AppsKilled/Failed metrics and failure reasons

Hello,

Was trying to debug reasons for Killed/Failed apps and was checking for the
applications that were killed/failed in RM logs - from RMAuditLogger.
QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it
possible that some logs are missed by AuditLogger or is it the other way
round and metrics are being reported higher ?

Thanks
Suma

Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Suma Shivaprasad <su...@gmail.com>.

Thanks for your inputs. The cluster Metrics API is giving correct numbers
for the failed/killed apps and is matching with the RM audit logs and we
are planning to use that instead.

Suma

On Wed, Feb 4, 2015 at 12:04 PM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

> There are several ways to confirm from YARN that total number of
> Killed/Failed applications in cluster
> 1. Get from RM web UI lists OR
> 2. From admin try using this to get numbers of failed and killed
> applications: ./yarn application -list -appStates FAILED,KILLED
> 3. Using client API's
>
> Since metrics values are displayed in ganglia is incorrect, I get doubt
> that
> 1. does ganglia is pointing out to correct RM cluster? Or
> 2. what is the method ganglia uses to retrieve QueueMetrics?
> 3. Any client program calculates you have written retrieve apps and
> calculate it?
>
>
> Thanks & Regards
> Rohith Sharma K S
>
> -----Original Message-----
> From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> Sent: 04 February 2015 11:03
> To: user@hadoop.apache.org
> Cc: yarn-dev@hadoop.apache.org
> Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
> Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
> The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
> which is very high wrt to the apps running at any given time(40-60). The RM
> logs though show 0 failed apps in audit logs during that hour.
> The RM UI also doesnt show any apps in Applications->Failed tab . The logs
> are getting rolled over at a slower rate ..every 1-2 hours. Am searching
> for "Application Finished - Failed" to find the apps failed. Please let me
> know if I am missing something here.
>
> Thanks
> Suma
>
>
>
>
> On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
> rohithsharmaks@huawei.com> wrote:
>
> >  Hi
> >
> >
> >
> > Could you give more information, which version of hadoop are you using?
> >
> >
> >
> > >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> >
> > May be I suspect that Logs might be rolled out. Does more applications
> > are running?
> >
> >
> >
> > All the applications history will be displayed  on RM web UI (provided
> > RM is not restarted or RM recovery enabled). May be you can check
> > these applications lists.
> >
> >
> >
> > For finding reasons for application killed/failed, one way is you can
> > check in NodeManager logs also. Here  you need to check using
> > container_id for corresponding application.
> >
> >
> >
> > Thanks & Regards
> >
> > Rohith Sharma K S
> >
> >
> >
> > *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> > *Sent:* 03 February 2015 21:35
> > *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> > *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
> >
> >
> >
> > Hello,
> >
> >
> > Was trying to debug reasons for Killed/Failed apps and was checking
> > for the applications that were killed/failed in RM logs - from
> RMAuditLogger.
> >
> >  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> > Is it possible that some logs are missed by AuditLogger or is it the
> > other way round and metrics are being reported higher ?
> >
> > Thanks
> >
> > Suma
> >
>

Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Suma Shivaprasad <su...@gmail.com>.

Thanks for your inputs. The cluster Metrics API is giving correct numbers
for the failed/killed apps and is matching with the RM audit logs and we
are planning to use that instead.

Suma

On Wed, Feb 4, 2015 at 12:04 PM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

> There are several ways to confirm from YARN that total number of
> Killed/Failed applications in cluster
> 1. Get from RM web UI lists OR
> 2. From admin try using this to get numbers of failed and killed
> applications: ./yarn application -list -appStates FAILED,KILLED
> 3. Using client API's
>
> Since metrics values are displayed in ganglia is incorrect, I get doubt
> that
> 1. does ganglia is pointing out to correct RM cluster? Or
> 2. what is the method ganglia uses to retrieve QueueMetrics?
> 3. Any client program calculates you have written retrieve apps and
> calculate it?
>
>
> Thanks & Regards
> Rohith Sharma K S
>
> -----Original Message-----
> From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> Sent: 04 February 2015 11:03
> To: user@hadoop.apache.org
> Cc: yarn-dev@hadoop.apache.org
> Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
> Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
> The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
> which is very high wrt to the apps running at any given time(40-60). The RM
> logs though show 0 failed apps in audit logs during that hour.
> The RM UI also doesnt show any apps in Applications->Failed tab . The logs
> are getting rolled over at a slower rate ..every 1-2 hours. Am searching
> for "Application Finished - Failed" to find the apps failed. Please let me
> know if I am missing something here.
>
> Thanks
> Suma
>
>
>
>
> On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
> rohithsharmaks@huawei.com> wrote:
>
> >  Hi
> >
> >
> >
> > Could you give more information, which version of hadoop are you using?
> >
> >
> >
> > >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> >
> > May be I suspect that Logs might be rolled out. Does more applications
> > are running?
> >
> >
> >
> > All the applications history will be displayed  on RM web UI (provided
> > RM is not restarted or RM recovery enabled). May be you can check
> > these applications lists.
> >
> >
> >
> > For finding reasons for application killed/failed, one way is you can
> > check in NodeManager logs also. Here  you need to check using
> > container_id for corresponding application.
> >
> >
> >
> > Thanks & Regards
> >
> > Rohith Sharma K S
> >
> >
> >
> > *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> > *Sent:* 03 February 2015 21:35
> > *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> > *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
> >
> >
> >
> > Hello,
> >
> >
> > Was trying to debug reasons for Killed/Failed apps and was checking
> > for the applications that were killed/failed in RM logs - from
> RMAuditLogger.
> >
> >  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> > Is it possible that some logs are missed by AuditLogger or is it the
> > other way round and metrics are being reported higher ?
> >
> > Thanks
> >
> > Suma
> >
>

Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Suma Shivaprasad <su...@gmail.com>.

Thanks for your inputs. The cluster Metrics API is giving correct numbers
for the failed/killed apps and is matching with the RM audit logs and we
are planning to use that instead.

Suma

On Wed, Feb 4, 2015 at 12:04 PM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

> There are several ways to confirm from YARN that total number of
> Killed/Failed applications in cluster
> 1. Get from RM web UI lists OR
> 2. From admin try using this to get numbers of failed and killed
> applications: ./yarn application -list -appStates FAILED,KILLED
> 3. Using client API's
>
> Since metrics values are displayed in ganglia is incorrect, I get doubt
> that
> 1. does ganglia is pointing out to correct RM cluster? Or
> 2. what is the method ganglia uses to retrieve QueueMetrics?
> 3. Any client program calculates you have written retrieve apps and
> calculate it?
>
>
> Thanks & Regards
> Rohith Sharma K S
>
> -----Original Message-----
> From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> Sent: 04 February 2015 11:03
> To: user@hadoop.apache.org
> Cc: yarn-dev@hadoop.apache.org
> Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
> Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
> The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
> which is very high wrt to the apps running at any given time(40-60). The RM
> logs though show 0 failed apps in audit logs during that hour.
> The RM UI also doesnt show any apps in Applications->Failed tab . The logs
> are getting rolled over at a slower rate ..every 1-2 hours. Am searching
> for "Application Finished - Failed" to find the apps failed. Please let me
> know if I am missing something here.
>
> Thanks
> Suma
>
>
>
>
> On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
> rohithsharmaks@huawei.com> wrote:
>
> >  Hi
> >
> >
> >
> > Could you give more information, which version of hadoop are you using?
> >
> >
> >
> > >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> >
> > May be I suspect that Logs might be rolled out. Does more applications
> > are running?
> >
> >
> >
> > All the applications history will be displayed  on RM web UI (provided
> > RM is not restarted or RM recovery enabled). May be you can check
> > these applications lists.
> >
> >
> >
> > For finding reasons for application killed/failed, one way is you can
> > check in NodeManager logs also. Here  you need to check using
> > container_id for corresponding application.
> >
> >
> >
> > Thanks & Regards
> >
> > Rohith Sharma K S
> >
> >
> >
> > *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> > *Sent:* 03 February 2015 21:35
> > *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> > *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
> >
> >
> >
> > Hello,
> >
> >
> > Was trying to debug reasons for Killed/Failed apps and was checking
> > for the applications that were killed/failed in RM logs - from
> RMAuditLogger.
> >
> >  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> > Is it possible that some logs are missed by AuditLogger or is it the
> > other way round and metrics are being reported higher ?
> >
> > Thanks
> >
> > Suma
> >
>

Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Suma Shivaprasad <su...@gmail.com>.

Thanks for your inputs. The cluster Metrics API is giving correct numbers
for the failed/killed apps and is matching with the RM audit logs and we
are planning to use that instead.

Suma

On Wed, Feb 4, 2015 at 12:04 PM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

> There are several ways to confirm from YARN that total number of
> Killed/Failed applications in cluster
> 1. Get from RM web UI lists OR
> 2. From admin try using this to get numbers of failed and killed
> applications: ./yarn application -list -appStates FAILED,KILLED
> 3. Using client API's
>
> Since metrics values are displayed in ganglia is incorrect, I get doubt
> that
> 1. does ganglia is pointing out to correct RM cluster? Or
> 2. what is the method ganglia uses to retrieve QueueMetrics?
> 3. Any client program calculates you have written retrieve apps and
> calculate it?
>
>
> Thanks & Regards
> Rohith Sharma K S
>
> -----Original Message-----
> From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> Sent: 04 February 2015 11:03
> To: user@hadoop.apache.org
> Cc: yarn-dev@hadoop.apache.org
> Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
> Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
> The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
> which is very high wrt to the apps running at any given time(40-60). The RM
> logs though show 0 failed apps in audit logs during that hour.
> The RM UI also doesnt show any apps in Applications->Failed tab . The logs
> are getting rolled over at a slower rate ..every 1-2 hours. Am searching
> for "Application Finished - Failed" to find the apps failed. Please let me
> know if I am missing something here.
>
> Thanks
> Suma
>
>
>
>
> On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
> rohithsharmaks@huawei.com> wrote:
>
> >  Hi
> >
> >
> >
> > Could you give more information, which version of hadoop are you using?
> >
> >
> >
> > >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> >
> > May be I suspect that Logs might be rolled out. Does more applications
> > are running?
> >
> >
> >
> > All the applications history will be displayed  on RM web UI (provided
> > RM is not restarted or RM recovery enabled). May be you can check
> > these applications lists.
> >
> >
> >
> > For finding reasons for application killed/failed, one way is you can
> > check in NodeManager logs also. Here  you need to check using
> > container_id for corresponding application.
> >
> >
> >
> > Thanks & Regards
> >
> > Rohith Sharma K S
> >
> >
> >
> > *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> > *Sent:* 03 February 2015 21:35
> > *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> > *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
> >
> >
> >
> > Hello,
> >
> >
> > Was trying to debug reasons for Killed/Failed apps and was checking
> > for the applications that were killed/failed in RM logs - from
> RMAuditLogger.
> >
> >  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> > Is it possible that some logs are missed by AuditLogger or is it the
> > other way round and metrics are being reported higher ?
> >
> > Thanks
> >
> > Suma
> >
>

Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Suma Shivaprasad <su...@gmail.com>.

Thanks for your inputs. The cluster Metrics API is giving correct numbers
for the failed/killed apps and is matching with the RM audit logs and we
are planning to use that instead.

Suma

On Wed, Feb 4, 2015 at 12:04 PM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

> There are several ways to confirm from YARN that total number of
> Killed/Failed applications in cluster
> 1. Get from RM web UI lists OR
> 2. From admin try using this to get numbers of failed and killed
> applications: ./yarn application -list -appStates FAILED,KILLED
> 3. Using client API's
>
> Since metrics values are displayed in ganglia is incorrect, I get doubt
> that
> 1. does ganglia is pointing out to correct RM cluster? Or
> 2. what is the method ganglia uses to retrieve QueueMetrics?
> 3. Any client program calculates you have written retrieve apps and
> calculate it?
>
>
> Thanks & Regards
> Rohith Sharma K S
>
> -----Original Message-----
> From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> Sent: 04 February 2015 11:03
> To: user@hadoop.apache.org
> Cc: yarn-dev@hadoop.apache.org
> Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
> Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
> The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
> which is very high wrt to the apps running at any given time(40-60). The RM
> logs though show 0 failed apps in audit logs during that hour.
> The RM UI also doesnt show any apps in Applications->Failed tab . The logs
> are getting rolled over at a slower rate ..every 1-2 hours. Am searching
> for "Application Finished - Failed" to find the apps failed. Please let me
> know if I am missing something here.
>
> Thanks
> Suma
>
>
>
>
> On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
> rohithsharmaks@huawei.com> wrote:
>
> >  Hi
> >
> >
> >
> > Could you give more information, which version of hadoop are you using?
> >
> >
> >
> > >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> >
> > May be I suspect that Logs might be rolled out. Does more applications
> > are running?
> >
> >
> >
> > All the applications history will be displayed  on RM web UI (provided
> > RM is not restarted or RM recovery enabled). May be you can check
> > these applications lists.
> >
> >
> >
> > For finding reasons for application killed/failed, one way is you can
> > check in NodeManager logs also. Here  you need to check using
> > container_id for corresponding application.
> >
> >
> >
> > Thanks & Regards
> >
> > Rohith Sharma K S
> >
> >
> >
> > *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> > *Sent:* 03 February 2015 21:35
> > *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> > *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
> >
> >
> >
> > Hello,
> >
> >
> > Was trying to debug reasons for Killed/Failed apps and was checking
> > for the applications that were killed/failed in RM logs - from
> RMAuditLogger.
> >
> >  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> > Is it possible that some logs are missed by AuditLogger or is it the
> > other way round and metrics are being reported higher ?
> >
> > Thanks
> >
> > Suma
> >
>

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Rohith Sharma K S <ro...@huawei.com>.

There are several ways to confirm from YARN that total number of Killed/Failed applications in cluster
1. Get from RM web UI lists OR
2. From admin try using this to get numbers of failed and killed applications: ./yarn application -list -appStates FAILED,KILLED
3. Using client API's

Since metrics values are displayed in ganglia is incorrect, I get doubt that 
1. does ganglia is pointing out to correct RM cluster? Or 
2. what is the method ganglia uses to retrieve QueueMetrics? 
3. Any client program calculates you have written retrieve apps and calculate it?

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com] 
Sent: 04 February 2015 11:03
To: user@hadoop.apache.org
Cc: yarn-dev@hadoop.apache.org
Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which is very high wrt to the apps running at any given time(40-60). The RM logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs are getting rolled over at a slower rate ..every 1-2 hours. Am searching for "Application Finished - Failed" to find the apps failed. Please let me know if I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S < rohithsharmaks@huawei.com> wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications 
> are running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided 
> RM is not restarted or RM recovery enabled). May be you can check 
> these applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can 
> check in NodeManager logs also. Here  you need to check using 
> container_id for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking 
> for the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. 
> Is it possible that some logs are missed by AuditLogger or is it the 
> other way round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Rohith Sharma K S <ro...@huawei.com>.

There are several ways to confirm from YARN that total number of Killed/Failed applications in cluster
1. Get from RM web UI lists OR
2. From admin try using this to get numbers of failed and killed applications: ./yarn application -list -appStates FAILED,KILLED
3. Using client API's

Since metrics values are displayed in ganglia is incorrect, I get doubt that 
1. does ganglia is pointing out to correct RM cluster? Or 
2. what is the method ganglia uses to retrieve QueueMetrics? 
3. Any client program calculates you have written retrieve apps and calculate it?

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com] 
Sent: 04 February 2015 11:03
To: user@hadoop.apache.org
Cc: yarn-dev@hadoop.apache.org
Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which is very high wrt to the apps running at any given time(40-60). The RM logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs are getting rolled over at a slower rate ..every 1-2 hours. Am searching for "Application Finished - Failed" to find the apps failed. Please let me know if I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S < rohithsharmaks@huawei.com> wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications 
> are running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided 
> RM is not restarted or RM recovery enabled). May be you can check 
> these applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can 
> check in NodeManager logs also. Here  you need to check using 
> container_id for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking 
> for the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. 
> Is it possible that some logs are missed by AuditLogger or is it the 
> other way round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Rohith Sharma K S <ro...@huawei.com>.

There are several ways to confirm from YARN that total number of Killed/Failed applications in cluster
1. Get from RM web UI lists OR
2. From admin try using this to get numbers of failed and killed applications: ./yarn application -list -appStates FAILED,KILLED
3. Using client API's

Since metrics values are displayed in ganglia is incorrect, I get doubt that 
1. does ganglia is pointing out to correct RM cluster? Or 
2. what is the method ganglia uses to retrieve QueueMetrics? 
3. Any client program calculates you have written retrieve apps and calculate it?

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com] 
Sent: 04 February 2015 11:03
To: user@hadoop.apache.org
Cc: yarn-dev@hadoop.apache.org
Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which is very high wrt to the apps running at any given time(40-60). The RM logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs are getting rolled over at a slower rate ..every 1-2 hours. Am searching for "Application Finished - Failed" to find the apps failed. Please let me know if I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S < rohithsharmaks@huawei.com> wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications 
> are running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided 
> RM is not restarted or RM recovery enabled). May be you can check 
> these applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can 
> check in NodeManager logs also. Here  you need to check using 
> container_id for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking 
> for the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. 
> Is it possible that some logs are missed by AuditLogger or is it the 
> other way round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Rohith Sharma K S <ro...@huawei.com>.

There are several ways to confirm from YARN that total number of Killed/Failed applications in cluster
1. Get from RM web UI lists OR
2. From admin try using this to get numbers of failed and killed applications: ./yarn application -list -appStates FAILED,KILLED
3. Using client API's

Since metrics values are displayed in ganglia is incorrect, I get doubt that 
1. does ganglia is pointing out to correct RM cluster? Or 
2. what is the method ganglia uses to retrieve QueueMetrics? 
3. Any client program calculates you have written retrieve apps and calculate it?

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com] 
Sent: 04 February 2015 11:03
To: user@hadoop.apache.org
Cc: yarn-dev@hadoop.apache.org
Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which is very high wrt to the apps running at any given time(40-60). The RM logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs are getting rolled over at a slower rate ..every 1-2 hours. Am searching for "Application Finished - Failed" to find the apps failed. Please let me know if I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S < rohithsharmaks@huawei.com> wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications 
> are running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided 
> RM is not restarted or RM recovery enabled). May be you can check 
> these applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can 
> check in NodeManager logs also. Here  you need to check using 
> container_id for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking 
> for the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. 
> Is it possible that some logs are missed by AuditLogger or is it the 
> other way round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Rohith Sharma K S <ro...@huawei.com>.

There are several ways to confirm from YARN that total number of Killed/Failed applications in cluster
1. Get from RM web UI lists OR
2. From admin try using this to get numbers of failed and killed applications: ./yarn application -list -appStates FAILED,KILLED
3. Using client API's

Since metrics values are displayed in ganglia is incorrect, I get doubt that 
1. does ganglia is pointing out to correct RM cluster? Or 
2. what is the method ganglia uses to retrieve QueueMetrics? 
3. Any client program calculates you have written retrieve apps and calculate it?

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com] 
Sent: 04 February 2015 11:03
To: user@hadoop.apache.org
Cc: yarn-dev@hadoop.apache.org
Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which is very high wrt to the apps running at any given time(40-60). The RM logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs are getting rolled over at a slower rate ..every 1-2 hours. Am searching for "Application Finished - Failed" to find the apps failed. Please let me know if I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S < rohithsharmaks@huawei.com> wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications 
> are running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided 
> RM is not restarted or RM recovery enabled). May be you can check 
> these applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can 
> check in NodeManager logs also. Here  you need to check using 
> container_id for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking 
> for the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. 
> Is it possible that some logs are missed by AuditLogger or is it the 
> other way round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Suma Shivaprasad <su...@gmail.com>.

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
which is very high wrt to the apps running at any given time(40-60). The RM
logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs
are getting rolled over at a slower rate ..every 1-2 hours. Am searching
for "Application Finished - Failed" to find the apps failed. Please let me
know if I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications are
> running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided RM
> is not restarted or RM recovery enabled). May be you can check these
> applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can
> check in NodeManager logs also. Here  you need to check using container_id
> for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking for
> the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it
> possible that some logs are missed by AuditLogger or is it the other way
> round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Suma Shivaprasad <su...@gmail.com>.

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
which is very high wrt to the apps running at any given time(40-60). The RM
logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs
are getting rolled over at a slower rate ..every 1-2 hours. Am searching
for "Application Finished - Failed" to find the apps failed. Please let me
know if I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications are
> running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided RM
> is not restarted or RM recovery enabled). May be you can check these
> applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can
> check in NodeManager logs also. Here  you need to check using container_id
> for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking for
> the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it
> possible that some logs are missed by AuditLogger or is it the other way
> round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Suma Shivaprasad <su...@gmail.com>.

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
which is very high wrt to the apps running at any given time(40-60). The RM
logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs
are getting rolled over at a slower rate ..every 1-2 hours. Am searching
for "Application Finished - Failed" to find the apps failed. Please let me
know if I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications are
> running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided RM
> is not restarted or RM recovery enabled). May be you can check these
> applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can
> check in NodeManager logs also. Here  you need to check using container_id
> for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking for
> the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it
> possible that some logs are missed by AuditLogger or is it the other way
> round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Suma Shivaprasad <su...@gmail.com>.

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
which is very high wrt to the apps running at any given time(40-60). The RM
logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs
are getting rolled over at a slower rate ..every 1-2 hours. Am searching
for "Application Finished - Failed" to find the apps failed. Please let me
know if I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications are
> running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided RM
> is not restarted or RM recovery enabled). May be you can check these
> applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can
> check in NodeManager logs also. Here  you need to check using container_id
> for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking for
> the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it
> possible that some logs are missed by AuditLogger or is it the other way
> round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Suma Shivaprasad <su...@gmail.com>.

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
which is very high wrt to the apps running at any given time(40-60). The RM
logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs
are getting rolled over at a slower rate ..every 1-2 hours. Am searching
for "Application Finished - Failed" to find the apps failed. Please let me
know if I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications are
> running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided RM
> is not restarted or RM recovery enabled). May be you can check these
> applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can
> check in NodeManager logs also. Here  you need to check using container_id
> for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking for
> the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it
> possible that some logs are missed by AuditLogger or is it the other way
> round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Rohith Sharma K S <ro...@huawei.com>.

Hi

Could you give more information, which version of hadoop are you using?


>> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
May be I suspect that Logs might be rolled out. Does more applications are running?

All the applications history will be displayed  on RM web UI (provided RM is not restarted or RM recovery enabled). May be you can check these applications lists.

For finding reasons for application killed/failed, one way is you can check in NodeManager logs also. Here  you need to check using container_id for corresponding application.

Thanks & Regards
Rohith Sharma K S

From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
Sent: 03 February 2015 21:35
To: user@hadoop.apache.org; yarn-dev@hadoop.apache.org
Subject: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Hello,

Was trying to debug reasons for Killed/Failed apps and was checking for the applications that were killed/failed in RM logs - from RMAuditLogger.
QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs are missed by AuditLogger or is it the other way round and metrics are being reported higher ?
Thanks
Suma

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Rohith Sharma K S <ro...@huawei.com>.

Hi

Could you give more information, which version of hadoop are you using?


>> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
May be I suspect that Logs might be rolled out. Does more applications are running?

All the applications history will be displayed  on RM web UI (provided RM is not restarted or RM recovery enabled). May be you can check these applications lists.

For finding reasons for application killed/failed, one way is you can check in NodeManager logs also. Here  you need to check using container_id for corresponding application.

Thanks & Regards
Rohith Sharma K S

From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
Sent: 03 February 2015 21:35
To: user@hadoop.apache.org; yarn-dev@hadoop.apache.org
Subject: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Hello,

Was trying to debug reasons for Killed/Failed apps and was checking for the applications that were killed/failed in RM logs - from RMAuditLogger.
QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs are missed by AuditLogger or is it the other way round and metrics are being reported higher ?
Thanks
Suma

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Rohith Sharma K S <ro...@huawei.com>.

Hi

Could you give more information, which version of hadoop are you using?


>> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
May be I suspect that Logs might be rolled out. Does more applications are running?

All the applications history will be displayed  on RM web UI (provided RM is not restarted or RM recovery enabled). May be you can check these applications lists.

For finding reasons for application killed/failed, one way is you can check in NodeManager logs also. Here  you need to check using container_id for corresponding application.

Thanks & Regards
Rohith Sharma K S

From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
Sent: 03 February 2015 21:35
To: user@hadoop.apache.org; yarn-dev@hadoop.apache.org
Subject: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Hello,

Was trying to debug reasons for Killed/Failed apps and was checking for the applications that were killed/failed in RM logs - from RMAuditLogger.
QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs are missed by AuditLogger or is it the other way round and metrics are being reported higher ?
Thanks
Suma

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Rohith Sharma K S <ro...@huawei.com>.

Hi

Could you give more information, which version of hadoop are you using?


>> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
May be I suspect that Logs might be rolled out. Does more applications are running?

All the applications history will be displayed  on RM web UI (provided RM is not restarted or RM recovery enabled). May be you can check these applications lists.

For finding reasons for application killed/failed, one way is you can check in NodeManager logs also. Here  you need to check using container_id for corresponding application.

Thanks & Regards
Rohith Sharma K S

From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
Sent: 03 February 2015 21:35
To: user@hadoop.apache.org; yarn-dev@hadoop.apache.org
Subject: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Hello,

Was trying to debug reasons for Killed/Failed apps and was checking for the applications that were killed/failed in RM logs - from RMAuditLogger.
QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs are missed by AuditLogger or is it the other way round and metrics are being reported higher ?
Thanks
Suma

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Posted by Rohith Sharma K S <ro...@huawei.com>.

Hi

Could you give more information, which version of hadoop are you using?


>> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
May be I suspect that Logs might be rolled out. Does more applications are running?

All the applications history will be displayed  on RM web UI (provided RM is not restarted or RM recovery enabled). May be you can check these applications lists.

For finding reasons for application killed/failed, one way is you can check in NodeManager logs also. Here  you need to check using container_id for corresponding application.

Thanks & Regards
Rohith Sharma K S

From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
Sent: 03 February 2015 21:35
To: user@hadoop.apache.org; yarn-dev@hadoop.apache.org
Subject: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Hello,

Was trying to debug reasons for Killed/Failed apps and was checking for the applications that were killed/failed in RM logs - from RMAuditLogger.
QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs are missed by AuditLogger or is it the other way round and metrics are being reported higher ?
Thanks
Suma