You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Bilahari T H (Jira)" <ji...@apache.org> on 2020/07/08 20:00:00 UTC

[jira] [Commented] (HADOOP-17092) ABFS: Long waits and unintended retries when multiple threads try to fetch token using ClientCreds

    [ https://issues.apache.org/jira/browse/HADOOP-17092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153971#comment-17153971 ] 

Bilahari T H commented on HADOOP-17092:
---------------------------------------

If the token fetch call results in IOException, the AbfsRestOperation layer retries the same. Currently this is configured to 30 attempts.
We have added it's own retry policy for token fetch call as part of this PR. The fix is, if all the attempts are failed and resulted in IOException, we are converting the same to HttpException so that the above layer AbfsRestOperation will not attempt retrying.

> ABFS: Long waits and unintended retries when multiple threads try to fetch token using ClientCreds
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-17092
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17092
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/azure
>    Affects Versions: 3.3.0
>            Reporter: Sneha Vijayarajan
>            Assignee: Bilahari T H
>            Priority: Major
>             Fix For: 3.4.0
>
>
> Issue reported by DB:
> we recently experienced some problems with ABFS driver that highlighted a possible issue with long hangs following synchronized retries when using the _ClientCredsTokenProvider_ and calling _AbfsClient.getAccessToken_. We have seen [https://github.com/apache/hadoop/pull/1923|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fhadoop%2Fpull%2F1923&data=02%7c01%7csnvijaya%40microsoft.com%7c7362c5ba4af24a553c4308d807ec459d%7c72f988bf86f141af91ab2d7cd011db47%7c1%7c0%7c637268058650442694&sdata=FePBBkEqj5kI2Ty4kNr3a2oJgB8Kvy3NvyRK8NoxyH4%3D&reserved=0], but it does not directly apply since we are not using a custom token provider, but instead _ClientCredsTokenProvider_ that ultimately relies on _AzureADAuthenticator_. 
>  
> The problem was that the critical section of getAccessToken, combined with a possibly redundant retry policy, made jobs hanging for a very long time, since only one thread at a time could make progress, and this progress amounted to basically retrying on a failing connection for 30-60 minutes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org