You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "bryanck (via GitHub)" <gi...@apache.org> on 2023/04/01 23:56:57 UTC

[GitHub] [iceberg] bryanck opened a new pull request, #7262: AWS: abort S3 input stream on close if not EOS

bryanck opened a new pull request, #7262:
URL: https://github.com/apache/iceberg/pull/7262

   This PR adds a check when closing an S3 input stream if the stream is at end-of-stream. If not, it calls `abort()` on the stream instead of `close()`. This avoids reading to the end of the stream when closing.
   
   When using the URL connection client with AWS, the underlying stream performs this [EOS check](https://github.com/JetBrains/jdk8u_jdk/blob/94318f9185757cc33d2b8d527d36be26ac6b7582/src/share/classes/sun/net/www/protocol/http/HttpURLConnection.java#L3520) on close and thus avoids unnecessarily reading the entire stream. The Apache HTTP connection [does not](https://github.com/apache/httpcomponents-core/blob/2135596b1319910d56efb148566ecbbec0884638/httpcore/src/main/java/org/apache/http/impl/io/ContentLengthInputStream.java#L108) behave this way.
   
   Now that the Apache HTTP client is the default instead of URL connection client for AWS, this change should prevent a performance regression.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156501895


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   [This](https://github.com/apache/httpcomponents-core/blob/2135596b1319910d56efb148566ecbbec0884638/httpcore/src/main/java/org/apache/http/impl/io/ContentLengthInputStream.java#L147)  is what gets thrown. Calling close on the aborted stream will attempt a read which then throws that.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "danielcweeks (via GitHub)" <gi...@apache.org>.
danielcweeks commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1155342460


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,27 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly

Review Comment:
   We might want to log the underlying reason as info or warn here, just so we don't silently swallow issues that may come up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156360544


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,29 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // log at trace level as closing an aborted stream will throw a content length
+        // check exception with the Apache HTTP client
+        LOG.trace("Error closing stream", e);

Review Comment:
   That means we will see exceptions for every stream close if trace is enabled. If it is expected to fail with the content length check, can we at least skip that? Then we don't need to log in trace, but we can log in maybe a warning level.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1155348897


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,27 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly

Review Comment:
   I added a trace log



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1494629800

   @singhpk234 you are right that changing the default HTTP client for a patch release is not appropriate, so we'd leave that out of 1.2.1. We'd like the Apache HTTP client fixes in for 1.2.1, though, so that can be used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156512441


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   I think `org.apache.http.ConnectionClosedException` is already in the class path given we have some REST client integration in core, but I might be wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156484851


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,29 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // log at trace level as closing an aborted stream will throw a content length
+        // check exception with the Apache HTTP client
+        LOG.trace("Error closing stream", e);
+      }
+      stream = null;
+    }
+  }
+
+  private void abortStream() {
+    try {
+      if (stream instanceof Abortable && stream.read() != -1) {
+        ((Abortable) stream).abort();
+      }
+    } catch (Exception e) {
+      LOG.debug("An error occurred while aborting the stream", e);

Review Comment:
   Is this expected? If not we should log with warn



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156363240


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,29 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // log at trace level as closing an aborted stream will throw a content length
+        // check exception with the Apache HTTP client
+        LOG.trace("Error closing stream", e);
+      }
+      stream = null;
+    }
+  }
+
+  private void abortStream() {
+    try {
+      if (stream instanceof Abortable && stream.read() != -1) {

Review Comment:
   If you abort the connection then it will be invalidated and removed from the pool, so it won't be reused. So this is an optimization to attempt to reuse the connection in cases where it has been fully read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156483763


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,29 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // log at trace level as closing an aborted stream will throw a content length
+        // check exception with the Apache HTTP client
+        LOG.trace("Error closing stream", e);

Review Comment:
   I see, that's interesting difference, thanks for the explanation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156498789


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   Ah okay, what does the exception look like? I though there are some error code that we can get out of the exception that would imply this specific error. Is that not there?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "danielcweeks (via GitHub)" <gi...@apache.org>.
danielcweeks commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1155342859


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java:
##########
@@ -212,9 +212,7 @@ public void write(byte[] b, int off, int len) throws IOException {
   }
 
   private void newStream() throws IOException {
-    if (stream != null) {
-      stream.close();
-    }
+    closeStream();

Review Comment:
   This change seems unrelated and I'm not sure what it really adds.  It looks like we're dereferencing the stream, but we still would throw in any situation where it actually have any affect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156378878


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,29 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // log at trace level as closing an aborted stream will throw a content length
+        // check exception with the Apache HTTP client
+        LOG.trace("Error closing stream", e);

Review Comment:
   That would work for the Apache HTTP client, the close doesn't do anything for that. But aborting the stream opened by URL connection client doesn't do anything, so we need to close it in that case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156513504


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   That is a different version than being used by the AWS SDK (v5 instead of v4). Also, the AWS bundle shades the library.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156521448


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -196,7 +197,29 @@ private void openStream() throws IOException {
 
   private void closeStream() throws IOException {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (IOException e) {
+        // the Apache HTTP client will throw a ConnectionClosedException
+        // when closing an aborted stream, which is expected
+        if (!e.getClass().getSimpleName().equals("ConnectionClosedException")) {

Review Comment:
   Oh okay I see this thread https://github.com/apache/iceberg/pull/7262#discussion_r1156495423, since it's specific to Apache Http client, it's not guaranteed it'll be on the classpath so we can't use instanceof



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156520423


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -196,7 +197,29 @@ private void openStream() throws IOException {
 
   private void closeStream() throws IOException {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (IOException e) {
+        // the Apache HTTP client will throw a ConnectionClosedException
+        // when closing an aborted stream, which is expected
+        if (!e.getClass().getSimpleName().equals("ConnectionClosedException")) {

Review Comment:
   Can we just do `instanceof` here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1155343093


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,27 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly

Review Comment:
   This will always throw on an aborted stream on a checksum check failure, so it could be very noisy



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156521658


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -196,7 +197,29 @@ private void openStream() throws IOException {
 
   private void closeStream() throws IOException {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (IOException e) {
+        // the Apache HTTP client will throw a ConnectionClosedException
+        // when closing an aborted stream, which is expected
+        if (!e.getClass().getSimpleName().equals("ConnectionClosedException")) {

Review Comment:
   That would add a dependency on the Apache HTTP client. Also, the AWS bundle shades the library, so the classes will be different depending on that (different packages).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156505661


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   Checking for the class name `ConnectionClosedException` might be the best option we have...? We can throw other IOExceptions like before, and skip if the name is that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156352749


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,29 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // log at trace level as closing an aborted stream will throw a content length
+        // check exception with the Apache HTTP client
+        LOG.trace("Error closing stream", e);
+      }
+      stream = null;
+    }
+  }
+
+  private void abortStream() {
+    try {
+      if (stream instanceof Abortable && stream.read() != -1) {

Review Comment:
   why do we want to read one more byte here? It might cause one more request. I think it does not hurt to even abort when it's already fully read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156366214


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,29 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // log at trace level as closing an aborted stream will throw a content length
+        // check exception with the Apache HTTP client
+        LOG.trace("Error closing stream", e);

Review Comment:
   @singhpk234 @danielcweeks are you ok if I revert this back to a no-op?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1493400555

   @singhpk234 The `abort()` is a no-op with the URL connection client, so it should behave mostly the same as before when using that client (except with the added read for the EOS check).
   
   I thought we wanted to get https://github.com/apache/iceberg/pull/7119 into 1.2.1, is that still the case @danielcweeks @nastra ? Without that, it becomes more cumbersome to configure, as including the URL connection client on the classpath then requires the default AWS http client be set also (e.g. via system properties).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1493425941

   I ran the TPC-DS benchmark with and without the changes in this PR using the Apache HTTP client. Without this change, the result was ~2x slower (as a result of reading far more data). With this change, the result was the same as with URL connection client (and the amount of data transferred was also the same).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156510477


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   Checking `ConnectionClosedException` seems good enough to me, because it makes sense that if the connection is already closed then we can omit the error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1495178367

   > Another option is we check which HTTP client is being used and change the close behavior based on that.
   
   
   Actually that's a potentially better approach, since we only have 2 officially supported client types `apache` and `urlconnection`, and their behaviors are quite different based on this exploration. 
   
   I am a bit worried about the 1 additional byte read, seems to only benefit `urlconnection` but could be safe or worse for `apache`. Switch behavior by client type would solve the issue.
   
   We could get the value from `awsProperties`, but one issue is that for users with a custom client factory, they might overwrote the entire client factory and thus have a HTTP client type mismatch. But even in that case, user can explicitly set the HTTP client to change the behavior, so it seems to be a safe approach.
   
   @bryanck @amogh-jahagirdar any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156373894


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,29 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // log at trace level as closing an aborted stream will throw a content length
+        // check exception with the Apache HTTP client
+        LOG.trace("Error closing stream", e);

Review Comment:
   I guess the question to ask before this is, why do we need to close it again if it's already aborted? If you see in the Trino logic I referenced, it's an if-else. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156495423


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   I see. `ContentLengthInputStream` is an Apache HTTP class, so referring to that will tie S3InputStream to that client.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1494827655

   Nice catch. 
   
   Sorry I totally forgot we did exactly the same abort thing for Trino: https://github.com/trinodb/trino/blob/master/plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java#L1579, 
   
   and also in my old PR https://github.com/apache/iceberg/pull/4912/files#diff-0b632866a3b10fac55c442b08178ec0ac72b3b600878243e15d788a8bd031054R299.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1155348432


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java:
##########
@@ -212,9 +212,7 @@ public void write(byte[] b, int off, int len) throws IOException {
   }
 
   private void newStream() throws IOException {
-    if (stream != null) {
-      stream.close();
-    }
+    closeStream();

Review Comment:
   I rolled this back



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156521658


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -196,7 +197,29 @@ private void openStream() throws IOException {
 
   private void closeStream() throws IOException {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (IOException e) {
+        // the Apache HTTP client will throw a ConnectionClosedException
+        // when closing an aborted stream, which is expected
+        if (!e.getClass().getSimpleName().equals("ConnectionClosedException")) {

Review Comment:
   That would add a dependency on the AWS-specific Apache HTTP client (v4). Also, the AWS bundle shades the library, so the classes will be different depending on that (different packages).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1155343566


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java:
##########
@@ -212,9 +212,7 @@ public void write(byte[] b, int off, int len) throws IOException {
   }
 
   private void newStream() throws IOException {
-    if (stream != null) {
-      stream.close();
-    }
+    closeStream();

Review Comment:
   `stream.close()` was being called without the null check in one place, so this was added to prevent a potential NPE, however this is unrelated to the performance regression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1155343075


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java:
##########
@@ -212,9 +212,7 @@ public void write(byte[] b, int off, int len) throws IOException {
   }
 
   private void newStream() throws IOException {
-    if (stream != null) {
-      stream.close();
-    }
+    closeStream();

Review Comment:
   This will always throw on an aborted stream on a checksum check failure, so it could be very noisy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1493425388

   @singhpk234 `abort()` is being called on the stream, in this case it is a `ResponseInputStream`, rather than on the client. You are correct that calling `abort()` will render the connection not reusable, so there is a trade off. I tried to balance that by checking for EOS and only aborting if we aren't at the EOS, so in cases we are we can reuse the connection. The trade off is reading a bunch of data unnecessarily vs reusing the connection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156493085


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   Sorry I guess I did not express it clearly. What I meant was that, can we check for the exception that if it is a content length check exception then we can skip, otherwise we log a warning?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156495827


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   I could check the name of the class, but that seems like a little bit of a hack.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.
singhpk234 commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1493398586

   Thanks @bryanck, this is great find and fix !
   
   one doubt will it cause issue when we switch back the client from apache to url-connection (let's say via table props) considering this was not a issue with url-connection client ? 
   
   Also do we need this in 1.2.1 considering 1.2.0 was still shipped with url-connection as default http client ? The regression commit is still in master [commit](https://github.com/apache/iceberg/commit/053028ceb6d3bc04c223a521600b053235bf344a). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156512487


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   Sounds good, I pushed that change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.
singhpk234 commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1493408082

   @bryanck apologies, as per my understanding, I thought it's other way round, based on this file [UrlConnectionHttpClient.java](https://github.com/aws/aws-sdk-java-v2/blob/master/http-clients/url-connection-client/src/main/java/software/amazon/awssdk/http/urlconnection/UrlConnectionHttpClient.java#L432-L434) `abort()` actually disconnects the connection and in the close it's a no-op [close()](https://github.com/aws/aws-sdk-java-v2/blob/master/http-clients/url-connection-client/src/main/java/software/amazon/awssdk/http/urlconnection/UrlConnectionHttpClient.java#L143-L145)
   
   I am also very curious about this statement from the [doc](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/core/ResponseInputStream.html) as well :
    
   "Input stream that provides access to the unmarshalled POJO response returned by the service in addition to the streamed contents. This input stream should be closed to release the underlying connection back to the connection pool.
   
   If it is not desired to read remaining data from the stream, you can explicitly abort the connection via [abort()](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/core/ResponseInputStream.html#abort()). Note that this will close the underlying connection and require establishing an HTTP connection which may outweigh the cost of reading the additional data."
   
   If we go by the last paragraph and IIUC, it says calling abort will not let the connection being re-used and the connection pool will have to establish a new http connection, can it cause regression in some scenario ?
   
   I am just trying to clear my understanding here. 
   
   
   -----
   
   > I thought we wanted to get https://github.com/apache/iceberg/pull/7119 into 1.2.1, is that still the case
   
   My understanding was 1.2.1 was only for bug fixes for 1.2.0 release.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1495180918

   I was hoping we could get it from S3Client but I didn't see a way, I'll look some more tonight


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1495207413

   I don't see a good way to get that, without more extensive changes, though I'd be happy to be proven wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#issuecomment-1495590624

   The only way I could find to determine the HTTP client type on the S3 client is to use reflection. So we could do that which is not ideal. Another options is to read the AwsProperties to get the type, but that may not match the actual HTTP client for custom factories. I feel like the current solution is the least invasive for the 1.2.1 release and achieves the desired outcome of performance parity between the two HTTP clients.
   
   There are alternatives to detecting the EOS also. We could get the content length of the response and track the number of bytes read from the stream, but I feel like that is heavier than the one byte read. Generaly that byte will be buffered if not EOS so shouldn't trigger any network I/O.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks merged pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "danielcweeks (via GitHub)" <gi...@apache.org>.
danielcweeks merged PR #7262:
URL: https://github.com/apache/iceberg/pull/7262


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156489594


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,29 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // log at trace level as closing an aborted stream will throw a content length
+        // check exception with the Apache HTTP client
+        LOG.trace("Error closing stream", e);
+      }
+      stream = null;
+    }
+  }
+
+  private void abortStream() {
+    try {
+      if (stream instanceof Abortable && stream.read() != -1) {
+        ((Abortable) stream).abort();
+      }
+    } catch (Exception e) {
+      LOG.debug("An error occurred while aborting the stream", e);

Review Comment:
   This shouldn't happen, I went ahead and changed this to `warn`. I also went ahead and removed the `trace` log for `close()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156515036


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,28 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // close quietly, closing an aborted stream will throw a content length

Review Comment:
   Another option is we check which HTTP client is being used and change the close behavior based on that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7262: AWS: abort S3 input stream on close if not EOS

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7262:
URL: https://github.com/apache/iceberg/pull/7262#discussion_r1156361904


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java:
##########
@@ -194,9 +195,29 @@ private void openStream() throws IOException {
     }
   }
 
-  private void closeStream() throws IOException {
+  private void closeStream() {
     if (stream != null) {
-      stream.close();
+      // if we aren't at the end of the stream, and the stream is abortable, then
+      // call abort() so we don't read the remaining data with the Apache HTTP client
+      abortStream();
+      try {
+        stream.close();
+      } catch (Exception e) {
+        // log at trace level as closing an aborted stream will throw a content length
+        // check exception with the Apache HTTP client
+        LOG.trace("Error closing stream", e);

Review Comment:
   That is how I originally had it but the feedback was to log something. I'd be happy to remove that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org