You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2022/12/01 17:19:12 UTC

[GitHub] [hadoop] steveloughran commented on a diff in pull request #5172: HADOOP-18543. AliyunOSSFileSystem#open(Path path, int bufferSize) use buffer size as its downloadPartSize

steveloughran commented on code in PR #5172:
URL: https://github.com/apache/hadoop/pull/5172#discussion_r1037378725


##########
hadoop-tools/hadoop-aliyun/src/main/java/org/apache/hadoop/fs/aliyun/oss/AliyunOSSInputStream.java:
##########
@@ -57,18 +57,21 @@ public class AliyunOSSInputStream extends FSInputStream {
   private ExecutorService readAheadExecutorService;
   private Queue<ReadBuffer> readBufferQueue = new ArrayDeque<>();
 
-  public AliyunOSSInputStream(Configuration conf,
-      ExecutorService readAheadExecutorService, int maxReadAheadPartNumber,
-      AliyunOSSFileSystemStore store, String key, Long contentLength,
-      Statistics statistics) throws IOException {
+  public AliyunOSSInputStream(
+          long downloadPartSize,
+          ExecutorService readAheadExecutorService,
+          int maxReadAheadPartNumber,
+          AliyunOSSFileSystemStore store,
+          String key,
+          Long contentLength,
+          Statistics statistics) throws IOException {
     this.readAheadExecutorService =
-        MoreExecutors.listeningDecorator(readAheadExecutorService);
+            MoreExecutors.listeningDecorator(readAheadExecutorService);
     this.store = store;
     this.key = key;
     this.statistics = statistics;
     this.contentLength = contentLength;
-    downloadPartSize = conf.getLong(MULTIPART_DOWNLOAD_SIZE_KEY,
-        MULTIPART_DOWNLOAD_SIZE_DEFAULT);
+    this.downloadPartSize = downloadPartSize;

Review Comment:
   its about the efficiencies of a GET call, overhead of creating HTTPS connections etc. which comes down to
   
   * reading a whole file is the wrong strategy for random IO formats (ORC, parquet)
   * random IO/small ranged GETs wrong for whole files.
   * even with random io, KB is way too small.
   
   This is why there's a tendency for the stores to do adaptive "first backwards/big forward seek means random IO", and since HADOOP-16202 let caller declared read policy.
   
   If the existing code was changed to say "we set GET range to be the buffer size you passed in on open()", then everyone's existing code is going to suffer really badly on performance.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org