You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Brandon (Jira)" <ji...@apache.org> on 2021/02/12 19:55:00 UTC

[jira] [Commented] (HADOOP-17377) ABFS: Frequent HTTP429 exceptions with MSI token provider

    [ https://issues.apache.org/jira/browse/HADOOP-17377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283947#comment-17283947 ] 

Brandon commented on HADOOP-17377:
----------------------------------

Another note. Very rarely, I've also seen HTTP 410 errors from the Instance Metadata Service. ABFS currently doesn't retry those. Azure documentation suggests 410 and 500 response codes should be retried: [https://docs.microsoft.com/en-in/azure/virtual-machines/linux/instance-metadata-service?tabs=windows#errors-and-debugging]

Here's the full error message and stack trace for reference:
{noformat}
AADToken: HTTP connection failed for getting token from AzureAD. Http response: 410 Gone
Content-Type: text/html Content-Length: 35 Request ID:  Proxies: none
First 1K of Body: The page you requested was removed.
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:190)
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:125)
	at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:506)
	at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:489)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:208)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:473)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:437)
	at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1717){noformat}

> ABFS: Frequent HTTP429 exceptions with MSI token provider
> ---------------------------------------------------------
>
>                 Key: HADOOP-17377
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17377
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/azure
>    Affects Versions: 3.2.1
>            Reporter: Brandon
>            Priority: Major
>
> *Summary*
>  The MSI token provider fetches auth tokens from the local instance metadata service.
>  The instance metadata service documentation states a limit of 5 requests per second: [https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service#error-and-debugging] which is fairly low.
> Using ABFS and the MSI token provider, especially when there are multiple JVMs running on the same host, ABFS frequently throws HTTP429 throttled exception. The implementation for fetching a token from MSI uses ExponentialRetryPolicy, however ExponentialRetryPolicy does not retry on status code 429, from my read of the code.
> Perhaps the ExponentialRetryPolicy could retry HTTP429 errors? I'm not sure what other ramifications that would have.
> *Environment*
>  This is in the context of Spark clusters running on Azure Virtual Machine Scale Sets. The Virtual Machine Scale Set is configured with a user-assigned identity. The Spark cluster is configured to download application JARs from an `abfs://` path, and auth to the storage account with the MSI token provider. The Spark version is 2.4.4. Hadoop libraries are version 3.2.1. More details on the Spark configuration: each VM runs 6 executor processes, and each executor process uses 5 cores. The FileSystem objects are singletons within each JVM due to the internal cache, so on each VM, I expect my setup is making 6 rapid requests to the instance metadata service when the executor is starting up and fetching the JAR.
> *Impact*
>  In my particular use case, the download operation itself is wrapped with 3 additional retries. I have never seen the download cause all the tries to be exhausted and fail. In the end, it seems to contribute mostly noise and slowness from the retries. However, having the HTTP429 handled robustly in the ABFS implementation would help application developers succeed and write cleaner code without wrapping individual ABFS operations with retries.
> *Example*
>  Here's an example error message and stack trace. It's always the same stack trace. This appears in my logs a few hundred to low thousands of times a day.
> {noformat}
> AADToken: HTTP connection failed for getting token from AzureAD. Http response: 429 null
> Content-Type: application/json; charset=utf-8 Content-Length: 90 Request ID:  Proxies: none
> First 1K of Body: {"error":"invalid_request","error_description":"Temporarily throttled, too many requests"}
> 	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:190)
> 	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:125)
> 	at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:506)
> 	at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:489)
> 	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:208)
> 	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:473)
> 	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:437)
> 	at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1717)
> 	at org.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:747)
> 	at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:724)
> 	at org.apache.spark.util.Utils$.fetchFile(Utils.scala:496)
> 	at org.apache.spark.executor.Executor.$anonfun$updateDependencies$7(Executor.scala:812)
> 	at org.apache.spark.executor.Executor.$anonfun$updateDependencies$7$adapted(Executor.scala:803)
> 	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:792)
> 	at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
> 	at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
> 	at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
> 	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
> 	at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
> 	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:791)
> 	at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:803)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:375)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748){noformat}
>  CC [~mackrorysd], [~stevel@apache.org]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org