You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org> on 2018/06/16 23:14:00 UTC

[jira] [Commented] (YARN-8302) ATS v2 should handle HBase connection issue properly

    [ https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514934#comment-16514934 ] 

Vinod Kumar Vavilapalli commented on YARN-8302:
-----------------------------------------------

If HBase is down, TimelineReader should fail the APIs instead of hanging them on HBase. Otherwise in addition to the poor usability that [~yeshavora] already pointed out, with enough hanging calls, we will DOS either the Reader or the machine.

In addition to that, Timeline reader should simply go down if HBase App is down for certain time.

We can do the above by having a separate thread in the TimelineReader which does a {{storageHealthCheck}} every so often and if things have been bad for a while, return error and things are worse for a long time, just shut-down.

> ATS v2 should handle HBase connection issue properly
> ----------------------------------------------------
>
>                 Key: YARN-8302
>                 URL: https://issues.apache.org/jira/browse/YARN-8302
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: ATSv2
>    Affects Versions: 3.1.0
>            Reporter: Yesha Vora
>            Priority: Major
>
> ATS v2 call times out with below error when it can't connect to HBase instance.
> {code}
> bash-4.2$ curl -i -k -s -1  -H 'Content-Type: application/json'  -H 'Accept: application/json' --max-time 5   --negotiate -u : 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092'
> curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received
> {code}
> {code:title=ATS log}
> 2018-05-15 23:10:03,623 INFO  client.RpcRetryingCallerImpl (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: xxx/xxx:17020, details=row 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:13,651 INFO  client.RpcRetryingCallerImpl (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: xxx/xxx:17020, details=row 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:23,730 INFO  client.RpcRetryingCallerImpl (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: xxx/xxx:17020, details=row 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:33,788 INFO  client.RpcRetryingCallerImpl (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: xxx/xxx:17020, details=row 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1526348294182, seqNum=-1{code}
> There are two issues here.
> 1) Check why ATS can't connect to HBase
> 2) In case of connection error,  ATS call should not get timeout. It should fail with proper error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org