You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2020/04/20 02:19:14 UTC

[GitHub] [hadoop-ozone] runzhiwang opened a new pull request #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

runzhiwang opened a new pull request #843:
URL: https://github.com/apache/hadoop-ozone/pull/843

## What changes were proposed in this pull request?
**What's the problem ?**

Read 1000M object, it cost about 470 seconds, i.e. 2.2M/s, which is too slow.
![image](https://user-images.githubusercontent.com/51938049/79706793-028ee280-82ed-11ea-8bea-ce34e712ff70.png)

**What's the reason ?**
When read 1000M object, there are 50 GET requests, each GET request read 20M. When do GET, the stack is: [IOUtils::copyLarge](https://github.com/apache/hadoop-ozone/blob/master/hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/ObjectEndpoint.java#L262) -> [IOUtils::skipFully](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/IOUtils.java#L1190) -> [IOUtils::skip](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/IOUtils.java#L2064) -> [InputStream::read](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/IOUtils.java#L1957).

It means, the 50th GET request which should read 980M-1000M, but to skip 0-980M, it also [InputStream::read](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/IOUtils.java#L1957) 0-980M. So the 1st GET request read 0-20M, the 2nd GET request read 0-40M, the 3rd GET request read 0-60M, ..., the 50th GET request read 0-1000M. So the GET request from 1st-50th become slower and slower.

You can also refer it [here](https://issues.apache.org/jira/browse/IO-203) why IOUtils implement skip by read rather than real skip, e.g. seek.

**How to improve ?**
Replace [IOUtils::skipFully](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/IOUtils.java#L1190) with [S3WrapperInputStream::seek](https://github.com/apache/hadoop-ozone/blob/master/hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/io/S3WrapperInputStream.java#L67).
After improving, read 1000M object cost 4.79 seconds, i.e. 219M/s, about 100 times faster.
![image](https://user-images.githubusercontent.com/51938049/79707421-01f74b80-82ef-11ea-9ae4-7bc7bde784e3.png)

## What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3223

## How was this patch tested?

Existed UT and IT.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org