You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@knox.apache.org by "Matthew Sharp (JIRA)" <ji...@apache.org> on 2018/09/04 17:08:00 UTC

[jira] [Commented] (KNOX-1093) KNOX Not Handling safemode state of one of the NameNode In HA state

    [ https://issues.apache.org/jira/browse/KNOX-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603333#comment-16603333 ] 

Matthew Sharp commented on KNOX-1093:
-------------------------------------

Replacing retryRequest() with failoverRequest() does result in the expected behavior mentioned above.  Attached a patch that replaces and cleans up un-used retryRequest() method and variables. 

 

Test cluster shows proper failover occur:

2018-09-04 12:01:21,495 INFO knox.gateway (AbstractHdfsHaDispatch.java:executeRequest(85)) - Received an error from a node in SafeMode: org.apache.knox.gateway.hdfs.dispatch.SafeModeException
2018-09-04 12:01:21,496 INFO knox.gateway (AbstractHdfsHaDispatch.java:failoverRequest(115)) - Failing over request to a different server: http://host1.test.com:50070/webhdfs/v1/user/matt/test.txt?op=CREATE&doAs=matt

> KNOX Not Handling safemode state of one of the NameNode In HA state 
> --------------------------------------------------------------------
>
>                 Key: KNOX-1093
>                 URL: https://issues.apache.org/jira/browse/KNOX-1093
>             Project: Apache Knox
>          Issue Type: Bug
>          Components: Server
>    Affects Versions: 0.10.0
>            Reporter: Rajesh Chandramohan
>            Priority: Major
>             Fix For: 1.2.0
>
>         Attachments: KNOX-1093.patch
>
>
>  per your code WebHdfsHaDispatch.java , When Safemode exception happened it calls the retryRequest() method. which also calls executeRequest() method as like failover request but the namenode info is not changing for the thread for all of its iteration until maxRetryAttempts=300 
> and retrySleep=1000 ( 1 sec ) 
> After Max 5 minutes , client retries should pick the right namenode atleast in next attempt.
>  But in this case if we need to copy a set of files in stipulated time there is X% of connections falls into these namenode and fails. Can we handle that better
> {code:java}
> try {
>          inboundResponse = executeOutboundRequest(outboundRequest);
>          writeOutboundResponse(outboundRequest, inboundRequest, outboundResponse, inboundResponse);
>       } catch (StandbyException e) {
>          LOG.errorReceivedFromStandbyNode(e);
>          failoverRequest(outboundRequest, inboundRequest, outboundResponse, inboundResponse, e);
>       } catch (SafeModeException e) {
>          LOG.errorReceivedFromSafeModeNode(e);
>          retryRequest(outboundRequest, inboundRequest, outboundResponse, inboundResponse, e);
>       } catch (IOException e) {
>          LOG.errorConnectingToServer(outboundRequest.getURI().toString(), e);
>          failoverRequest(outboundRequest, inboundRequest, outboundResponse, inboundResponse, e);
>       }
>    }
> {code}
> Need to change the logic in SafeModeexception state in  KNOX HADispatch code to flag the namenode which is stuck in safemode  and maintain don't try queue and redirect all further connection only to healthy active namenode . This way X5 of failures we can handle. What do we think



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)