You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2020/04/20 02:19:14 UTC

[GitHub] [hadoop-ozone] runzhiwang opened a new pull request #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

runzhiwang opened a new pull request #843:
URL: https://github.com/apache/hadoop-ozone/pull/843


   ## What changes were proposed in this pull request?
   **What's the problem ?**
   
   Read 1000M object, it cost about 470 seconds, i.e. 2.2M/s, which is too slow. 
   ![image](https://user-images.githubusercontent.com/51938049/79706793-028ee280-82ed-11ea-8bea-ce34e712ff70.png)
   
   **What's the reason ?**
   When read 1000M object, there are 50 GET requests, each GET request read 20M. When do GET, the stack is: [IOUtils::copyLarge](https://github.com/apache/hadoop-ozone/blob/master/hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/ObjectEndpoint.java#L262) -> [IOUtils::skipFully](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/IOUtils.java#L1190) -> [IOUtils::skip](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/IOUtils.java#L2064) -> [InputStream::read](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/IOUtils.java#L1957).
   
   It means, the 50th GET request which should read 980M-1000M, but to skip 0-980M, it also [InputStream::read](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/IOUtils.java#L1957) 0-980M. So the 1st GET request read 0-20M, the 2nd GET request read 0-40M, the 3rd GET request read 0-60M, ..., the 50th GET request read 0-1000M. So the GET  request from 1st-50th become slower and slower.
   
   You can also refer it [here](https://issues.apache.org/jira/browse/IO-203) why IOUtils implement skip by read rather than real skip, e.g. seek.
   
   **How to improve ?**
   Replace [IOUtils::skipFully](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/IOUtils.java#L1190)  with [S3WrapperInputStream::seek](https://github.com/apache/hadoop-ozone/blob/master/hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/io/S3WrapperInputStream.java#L67). 
   After improving, read 1000M object cost 4.79 seconds, i.e. 219M/s, about 100 times faster.
   ![image](https://user-images.githubusercontent.com/51938049/79707421-01f74b80-82ef-11ea-9ae4-7bc7bde784e3.png)
   
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-3223
   
   ## How was this patch tested?
   
   Existed UT and IT.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] bharatviswa504 commented on a change in pull request #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
bharatviswa504 commented on a change in pull request #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#discussion_r412437219



##########
File path: hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/io/S3WrapperInputStream.java
##########
@@ -76,4 +79,87 @@ public long getPos() throws IOException {
   public boolean seekToNewSource(long targetPos) throws IOException {
     return false;
   }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method buffers the input internally, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   * The buffer size is given by {@link #DEFAULT_BUFFER_SIZE}.
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all
+   * @return the number of bytes copied
+   * @throws NullPointerException if the input or output is null
+   * @throws IOException          if an I/O error occurs
+   */
+  public long copyLarge(final OutputStream output, final long inputOffset,
+      final long length) throws IOException {
+    return copyLarge(output, inputOffset, length,
+        new byte[DEFAULT_BUFFER_SIZE]);
+  }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method uses the provided buffer, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all

Review comment:
       Got it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on issue #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on issue #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#issuecomment-617816219


   @bharatviswa504 Hi, I move copyLarge into KeyInputStream, and add integration test in TestKeyInputStream for it. And also change all the use of IOUtils.copyLarge. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] bharatviswa504 commented on a change in pull request #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
bharatviswa504 commented on a change in pull request #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#discussion_r411883841



##########
File path: hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/io/S3WrapperInputStream.java
##########
@@ -76,4 +79,87 @@ public long getPos() throws IOException {
   public boolean seekToNewSource(long targetPos) throws IOException {
     return false;
   }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method buffers the input internally, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   * The buffer size is given by {@link #DEFAULT_BUFFER_SIZE}.
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all
+   * @return the number of bytes copied
+   * @throws NullPointerException if the input or output is null
+   * @throws IOException          if an I/O error occurs
+   */
+  public long copyLarge(final OutputStream output, final long inputOffset,
+      final long length) throws IOException {
+    return copyLarge(output, inputOffset, length,
+        new byte[DEFAULT_BUFFER_SIZE]);
+  }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method uses the provided buffer, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all

Review comment:
       -ve means all. But I don't see that handled in the code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#discussion_r411905664



##########
File path: hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/io/S3WrapperInputStream.java
##########
@@ -76,4 +79,87 @@ public long getPos() throws IOException {
   public boolean seekToNewSource(long targetPos) throws IOException {
     return false;
   }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method buffers the input internally, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   * The buffer size is given by {@link #DEFAULT_BUFFER_SIZE}.
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all
+   * @return the number of bytes copied
+   * @throws NullPointerException if the input or output is null
+   * @throws IOException          if an I/O error occurs
+   */
+  public long copyLarge(final OutputStream output, final long inputOffset,
+      final long length) throws IOException {
+    return copyLarge(output, inputOffset, length,
+        new byte[DEFAULT_BUFFER_SIZE]);
+  }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method uses the provided buffer, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all

Review comment:
       @bharatviswa504 Hi, thanks for your comments. I copy the code and java doc from commons-io and only change one line. i.e. `skipFully(input, inputOffset);` -> seek(inputOffset).
    And -ve read entire file till end. If `length < 0`, then bytesToRead equals to bufferLength and never change. So while loop will not break until `read == IOUtils.EOF`. So it call read entire file till end.
   ![image](https://user-images.githubusercontent.com/51938049/79832285-afe02400-83db-11ea-9f36-2ee754659bfc.png)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#discussion_r411905664



##########
File path: hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/io/S3WrapperInputStream.java
##########
@@ -76,4 +79,87 @@ public long getPos() throws IOException {
   public boolean seekToNewSource(long targetPos) throws IOException {
     return false;
   }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method buffers the input internally, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   * The buffer size is given by {@link #DEFAULT_BUFFER_SIZE}.
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all
+   * @return the number of bytes copied
+   * @throws NullPointerException if the input or output is null
+   * @throws IOException          if an I/O error occurs
+   */
+  public long copyLarge(final OutputStream output, final long inputOffset,
+      final long length) throws IOException {
+    return copyLarge(output, inputOffset, length,
+        new byte[DEFAULT_BUFFER_SIZE]);
+  }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method uses the provided buffer, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all

Review comment:
       @bharatviswa504 Hi, thanks for your comments. I copy the code and java doc from commons-io and only change one line. i.e. `skipFully(input, inputOffset);` -> `seek(inputOffset)`. And other changes are related to code style.
    And -ve read entire file till end. If `length < 0`, then bytesToRead equals to bufferLength and never change. So while loop will not break until `read == IOUtils.EOF`. So it call read entire file till end.
   
   The follow image shows the code of commons-io.
   ![image](https://user-images.githubusercontent.com/51938049/79832285-afe02400-83db-11ea-9f36-2ee754659bfc.png)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#discussion_r411905664



##########
File path: hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/io/S3WrapperInputStream.java
##########
@@ -76,4 +79,87 @@ public long getPos() throws IOException {
   public boolean seekToNewSource(long targetPos) throws IOException {
     return false;
   }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method buffers the input internally, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   * The buffer size is given by {@link #DEFAULT_BUFFER_SIZE}.
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all
+   * @return the number of bytes copied
+   * @throws NullPointerException if the input or output is null
+   * @throws IOException          if an I/O error occurs
+   */
+  public long copyLarge(final OutputStream output, final long inputOffset,
+      final long length) throws IOException {
+    return copyLarge(output, inputOffset, length,
+        new byte[DEFAULT_BUFFER_SIZE]);
+  }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method uses the provided buffer, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all

Review comment:
       @bharatviswa504 Hi, thanks for your comments. I copy the code and java doc from commons-io and only change one line. i.e. `skipFully(input, inputOffset);` -> `seek(inputOffset)`. Other changes are related to code style.
    And -ve read entire file till end. If `length < 0`, then bytesToRead equals to bufferLength and never change. So while loop will not break until `read == IOUtils.EOF`. So it call read entire file till end.
   
   The follow image shows the code of commons-io.
   ![image](https://user-images.githubusercontent.com/51938049/79832285-afe02400-83db-11ea-9f36-2ee754659bfc.png)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on issue #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on issue #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#issuecomment-616290170


   I will fix the failed UT


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang removed a comment on issue #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
runzhiwang removed a comment on issue #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#issuecomment-616290170


   I will fix the failed UT


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#discussion_r411905664



##########
File path: hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/io/S3WrapperInputStream.java
##########
@@ -76,4 +79,87 @@ public long getPos() throws IOException {
   public boolean seekToNewSource(long targetPos) throws IOException {
     return false;
   }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method buffers the input internally, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   * The buffer size is given by {@link #DEFAULT_BUFFER_SIZE}.
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all
+   * @return the number of bytes copied
+   * @throws NullPointerException if the input or output is null
+   * @throws IOException          if an I/O error occurs
+   */
+  public long copyLarge(final OutputStream output, final long inputOffset,
+      final long length) throws IOException {
+    return copyLarge(output, inputOffset, length,
+        new byte[DEFAULT_BUFFER_SIZE]);
+  }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method uses the provided buffer, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all

Review comment:
       @bharatviswa504 Hi, thanks for your comments.  If `length < 0`, then bytesToRead equals to bufferLength and never change. So while loop will not break until `read == IOUtils.EOF`. So it call read entire file till end.
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] bharatviswa504 commented on a change in pull request #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
bharatviswa504 commented on a change in pull request #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#discussion_r411883841



##########
File path: hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/io/S3WrapperInputStream.java
##########
@@ -76,4 +79,87 @@ public long getPos() throws IOException {
   public boolean seekToNewSource(long targetPos) throws IOException {
     return false;
   }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method buffers the input internally, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   * The buffer size is given by {@link #DEFAULT_BUFFER_SIZE}.
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all
+   * @return the number of bytes copied
+   * @throws NullPointerException if the input or output is null
+   * @throws IOException          if an I/O error occurs
+   */
+  public long copyLarge(final OutputStream output, final long inputOffset,
+      final long length) throws IOException {
+    return copyLarge(output, inputOffset, length,
+        new byte[DEFAULT_BUFFER_SIZE]);
+  }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method uses the provided buffer, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all

Review comment:
       -ve means all, so if I say -1/-2, read entire file till end?. 
   But I don't see that handled in the code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on issue #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on issue #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#issuecomment-616981757


   > Thank You @runzhiwang for the analysis and great improvement in read performance.
   > 
   > I see there is one more usage of IOUtils.copyLarge in createMultipartKey, can we update over there also with a similar change.
   > 
   > L534:
   > 
   > ```
   >             if (range != null) {
   >               RangeHeader rangeHeader =
   >                   RangeHeaderParserUtil.parseRangeHeader(range, 0);
   >               IOUtils.copyLarge(sourceObject, ozoneOutputStream,
   >                   rangeHeader.getStartOffset(),
   >                   rangeHeader.getEndOffset() - rangeHeader.getStartOffset());
   > 
   >             }
   > ```
   > 
   > And also can we add test for S3WrapperInputStream which tests a new method with multiple seek values.
   @bharatviswa504 I will handle this. Thanks for your comments.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #843: HDDS-3223. Improve s3g read 1GB object efficiency by 100 times

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #843:
URL: https://github.com/apache/hadoop-ozone/pull/843#discussion_r411905664



##########
File path: hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/io/S3WrapperInputStream.java
##########
@@ -76,4 +79,87 @@ public long getPos() throws IOException {
   public boolean seekToNewSource(long targetPos) throws IOException {
     return false;
   }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method buffers the input internally, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   * The buffer size is given by {@link #DEFAULT_BUFFER_SIZE}.
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all
+   * @return the number of bytes copied
+   * @throws NullPointerException if the input or output is null
+   * @throws IOException          if an I/O error occurs
+   */
+  public long copyLarge(final OutputStream output, final long inputOffset,
+      final long length) throws IOException {
+    return copyLarge(output, inputOffset, length,
+        new byte[DEFAULT_BUFFER_SIZE]);
+  }
+
+  /**
+   * Copies some or all bytes from a large (over 2GB) <code>InputStream</code>
+   * to an <code>OutputStream</code>, optionally skipping input bytes.
+   * <p>
+   * Copy the method from IOUtils of commons-io to reimplement skip by seek
+   * rather than read. The reason why IOUtils of commons-io implement skip
+   * by read can be found at
+   * <a href="https://issues.apache.org/jira/browse/IO-203">IO-203</a>.
+   * </p>
+   * <p>
+   * This method uses the provided buffer, so there is no need to use a
+   * <code>BufferedInputStream</code>.
+   * </p>
+   *
+   * @param output the <code>OutputStream</code> to write to
+   * @param inputOffset : number of bytes to skip from input before copying
+   * -ve values are ignored
+   * @param length : number of bytes to copy. -ve means all

Review comment:
       @bharatviswa504 Hi, thanks for your comments. I copy the code and java doc from commons-io and only change one line. i.e. `skipFully(input, inputOffset);` -> `seek(inputOffset)`. And other changes are related to code style.
    And -ve read entire file till end. If `length < 0`, then bytesToRead equals to bufferLength and never change. So while loop will not break until `read == IOUtils.EOF`. So it call read entire file till end.
   ![image](https://user-images.githubusercontent.com/51938049/79832285-afe02400-83db-11ea-9f36-2ee754659bfc.png)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org