You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Thomas (JIRA)" <ji...@apache.org> on 2017/06/30 23:24:00 UTC
[jira] [Comment Edited] (HADOOP-14535) Support for random access and seek of block blobs

    [ https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070851#comment-16070851 ] 

Thomas edited comment on HADOOP-14535 at 6/30/17 11:23 PM:
-----------------------------------------------------------

I am attaching the updated patch (0005-Random-access-and-seek-improvements-to-azure-file-system.patch).  Random access is as much as 90% faster for block blobs *without* any regressions.  There are unit tests demonstrating the performance (see TestBlockBlobInputStream.java) improvement for random access and unit tests demonstrating that there are no performance regressions in sequential reads after reverse seeks.  

However, please note that unit tests and various developer machines are not an appropriate environment for measuring performance.  The performance tests in TestBlockBlobInputStream.java merely demonstrate the behavior and prevent regressions.  There are many things which can impact performance measurements over short periods of time, such as but not limited to fluctuations in network traffic and routing, fluctuations in activity of other processes running on the client, fluctuations in load on the shared stamp that hosts your Azure Storage account, and throttling sometimes performed by enterprise IT departments.  The performance tests included with this change are written to execute quickly and work around these fluctuations, and prevent regressions in the code.  In the process of implementing and running these unit tests, I also validated the performance improvements by running variations of the code for longer periods and the results looked favorable.

My team plans to review and improve the instrumentation (Hadoop Metrics) for the wasb:// file system.  Although this change does not include new metrics, we will be looking into this in the future.

*ALL* tests in *"hadoop-tools/hadoop-azure"* are *passing* with the patch (0005-Random-access-and-seek-improvements-to-azure-file-system.patch).


was (Author: tmarquardt):
I am attaching the updated patch (0005-Random-access-and-seek-improvements-to-azure-file-system.patch).  Random access is as much as 90% faster for block blobs *without* any regressions.  There are unit tests demonstrating the performance (see TestBlockBlobInputStream.java) improvement for random access and unit tests demonstrating that there are no performance regressions in sequential reads after reverse seeks.  

However, please note that unit tests and various developer machines are not an appropriate environment for measuring performance.  The performance tests in TestBlockBlobInputStream.java merely demonstrate the behavior and prevent regressions.  There are many things which can impact performance measurements over short periods of time, such as but not limited to fluctuations in network traffic and routing, fluctuations in activity of other processes running on the client, fluctuations in load on the shared stamp that hosts your Azure Storage account, and throttling sometimes performed by enterprise IT departments.  The performance tests included with this change are written to execute quickly and work around these fluctuations, and prevent regressions in the code.  In the process of implementing and running these unit tests, I also validated the performance improvements by running variations of the code for longer periods and the results looked favorable.

My team plans to review and improve the instrumentation (Hadoop Metrics) for the wasb:// file system.  Although this change does not include new metrics, we will be looking into this in the future.

ALL tests in "hadoop-tools/hadoop-azure" are passing with the patch (0005-Random-access-and-seek-improvements-to-azure-file-system.patch).

> Support for random access and seek of block blobs
> -------------------------------------------------
>
>                 Key: HADOOP-14535
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14535
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/azure
>            Reporter: Thomas
>            Assignee: Thomas
>         Attachments: 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch, 0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch, 0005-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// file system.
> If seek() is not used or if only forward seek() is used, the behavior of read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over the network) in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading the actual number
> of bytes requested in the call to read(), with some constraints.  If the size requested is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network read will be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be satisfied by sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing and re-opening the
> stream, which for block blobs also involves a network operation to read the blob metadata. Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves the read position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org