You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Chackaravarthy (JIRA)" <ji...@apache.org> on 2016/07/28 20:23:20 UTC

[jira] [Commented] (YARN-5445) Log aggregation configured to different namenode can fail fast

    [ https://issues.apache.org/jira/browse/YARN-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398141#comment-15398141 ] 

Chackaravarthy commented on YARN-5445:
--------------------------------------

Environment : HDP-2.4 (hadoop-2.7.1)

The usecase is as follows :- 

Cluster is of size 1200 nodes and on average around 15k jobs running per day. Hence keeping applogs in the same cluster adds too much pressure on NN because of small files problem. Around 5Million files created per day (normal load) leading to 10Million FS objects for keeping one day logs itself. The requirement is to maintain atleast 1 week of log and hence decided to move it to different cluster or different namespace (NN federation).

In these cases, expecting minimal latency on jobs if the other cluster is completely down (though configured with HA). In such situation, would want to have minimal impact on applications running in cluster. But currently it does 15 attempts {{dfs.client.failover.max.attempts}} to connect to NN before giving it up. Hence adding a latency of 2 to 2.5 mins on each container launch (per node manager) and hence affecting over all job completion time.

(Aware of YARN-2942 which is still in progress and MAPREDUCE-6415 is in 2.8.0)

Can we have a new config to pass it as {{dfs.client.failover.max.attempts}} while creating FileSystem instance in LogAggregationService so that we can configure it to fail fast? Or any configs already available to handle this case?

> Log aggregation configured to different namenode can fail fast
> --------------------------------------------------------------
>
>                 Key: YARN-5445
>                 URL: https://issues.apache.org/jira/browse/YARN-5445
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Chackaravarthy
>
> Log aggregation is enabled and configured to write applogs to different cluster or different namespace (NN federation). In these cases, would like to have some configs on attempts or retries to fail fast in case the other cluster is completely down.
> Currently it takes default {{dfs.client.failover.max.attempts}} as 15 and hence adding a latency of 2 to 2.5 mins in each container launch (per node manager).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org