You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ivan Mitic (JIRA)" <ji...@apache.org> on 2015/05/28 07:38:22 UTC

[jira] [Updated] (HADOOP-11959) WASB should configure client side socket timeout in storage client blob request options

     [ https://issues.apache.org/jira/browse/HADOOP-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Mitic updated HADOOP-11959:
--------------------------------
    Attachment: HADOOP-11959.patch

Attaching the patch.

The fix is to move to the latest Azure storage client SDK where the SDK internally sets the reasonable socket timeout on the connection. This is actually the right fix, as this also automatically provides means for client SDK to internally retry on timeout errors.

Storage client SDK release notes:
https://github.com/Azure/azure-storage-java/releases
_Changed the socket timeout to default to 5 minutes rather than infinite when neither service side timeout or maximum execution time are set._


> WASB should configure client side socket timeout in storage client blob request options
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11959
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11959
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
>         Attachments: HADOOP-11959.patch
>
>
> On clusters/jobs where {{mapred.task.timeout}} is set to a larger value, we noticed that tasks can sometimes get stuck on the below stack.
> {code}
> Thread 1: (state = IN_NATIVE)
> - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Interpreted frame)
> - java.net.SocketInputStream.read(byte[], int, int, int) @bci=87, line=152 (Interpreted frame)
> - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=122 (Interpreted frame)
> - java.io.BufferedInputStream.fill() @bci=175, line=235 (Interpreted frame)
> - java.io.BufferedInputStream.read1(byte[], int, int) @bci=44, line=275 (Interpreted frame)
> - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 (Interpreted frame)
> - sun.net.www.MeteredStream.read(byte[], int, int) @bci=16, line=134 (Interpreted frame)
> - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 (Interpreted frame)
> - sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(byte[], int, int) @bci=4, line=3053 (Interpreted frame)
> - com.microsoft.azure.storage.core.NetworkInputStream.read(byte[], int, int) @bci=7, line=49 (Interpreted frame)
> - com.microsoft.azure.storage.blob.CloudBlob$10.postProcessResponse(java.net.HttpURLConnection, com.microsoft.azure.storage.blob.CloudBlob, com.microsoft.azure
> .storage.blob.CloudBlobClient, com.microsoft.azure.storage.OperationContext, java.lang.Integer) @bci=204, line=1691 (Interpreted frame)
> - com.microsoft.azure.storage.blob.CloudBlob$10.postProcessResponse(java.net.HttpURLConnection, java.lang.Object, java.lang.Object, com.microsoft.azure.storage
> .OperationContext, java.lang.Object) @bci=17, line=1613 (Interpreted frame)
> - com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(java.lang.Object, java.lang.Object, com.microsoft.azure.storage.core.StorageRequest, com.mi
> crosoft.azure.storage.RetryPolicyFactory, com.microsoft.azure.storage.OperationContext) @bci=352, line=148 (Interpreted frame)
> - com.microsoft.azure.storage.blob.CloudBlob.downloadRangeInternal(long, java.lang.Long, byte[], int, com.microsoft.azure.storage.AccessCondition, com.microsof
> t.azure.storage.blob.BlobRequestOptions, com.microsoft.azure.storage.OperationContext) @bci=131, line=1468 (Interpreted frame)
> - com.microsoft.azure.storage.blob.BlobInputStream.dispatchRead(int) @bci=31, line=255 (Interpreted frame)
> - com.microsoft.azure.storage.blob.BlobInputStream.readInternal(byte[], int, int) @bci=52, line=448 (Interpreted frame)
> - com.microsoft.azure.storage.blob.BlobInputStream.read(byte[], int, int) @bci=28, line=420 (Interpreted frame)
> - java.io.BufferedInputStream.read1(byte[], int, int) @bci=39, line=273 (Interpreted frame)
> - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 (Interpreted frame)
> - java.io.DataInputStream.read(byte[], int, int) @bci=7, line=149 (Interpreted frame)
> - org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(byte[], int, int) @bci=10, line=734 (Interpreted frame)
> - java.io.BufferedInputStream.read1(byte[], int, int) @bci=39, line=273 (Interpreted frame)
> - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 (Interpreted frame)
> - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted frame)
> - org.apache.hadoop.util.LineReader.fillBuffer(java.io.InputStream, byte[], boolean) @bci=2, line=180 (Interpreted frame)
> - org.apache.hadoop.util.LineReader.readDefaultLine(org.apache.hadoop.io.Text, int, int) @bci=64, line=216 (Compiled frame)
> - org.apache.hadoop.util.LineReader.readLine(org.apache.hadoop.io.Text, int, int) @bci=19, line=174 (Interpreted frame)
> - org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue() @bci=108, line=185 (Interpreted frame)
> - org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue() @bci=13, line=553 (Interpreted frame)
> - org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue() @bci=4, line=80 (Interpreted frame)
> - org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue() @bci=4, line=91 (Interpreted frame)
> - org.apache.hadoop.mapreduce.Mapper.run(org.apache.hadoop.mapreduce.Mapper$Context) @bci=6, line=144 (Interpreted frame)
> - org.apache.hadoop.mapred.MapTask.runNewMapper(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapreduce.split.JobSplit$TaskSplitIndex, org.apache.hadoop.
> mapred.TaskUmbilicalProtocol, org.apache.hadoop.mapred.Task$TaskReporter) @bci=228, line=784 (Interpreted frame)
> - org.apache.hadoop.mapred.MapTask.run(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=148, line=341 (Interpreted frame)
> - org.apache.hadoop.mapred.YarnChild$2.run() @bci=29, line=163 (Interpreted frame)
> - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Interpreted frame)
> - javax.security.auth.Subject.doAs(javax.security.auth.Subject, java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted frame)
> - org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction) @bci=14, line=1628 (Interpreted frame)
> - org.apache.hadoop.mapred.YarnChild.main(java.lang.String[]) @bci=514, line=158 (Interpreted frame)
> {code}
> The issue is that the storage client is by default not setting the socket timeout on its HTTP connections causing that in some (rare) circumstances we encounter a deadlock (e.g. whether the server on the other side just dies unexpectedly).  
> The fix is to configure the maximum operation time on the storage client request options. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)