You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/24 07:31:34 UTC

[GitHub] [hudi] BruceKellan opened a new pull request, #5953: [WIP][HUDI-4314] Improve the performance of reading from the specified ins…

BruceKellan opened a new pull request, #5953:
URL: https://github.com/apache/hudi/pull/5953

   …tant when the Flink streaming read application starts
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1170023135

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #5953: [WIP][HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #5953:
URL: https://github.com/apache/hudi/pull/5953#discussion_r906609490


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##########
@@ -216,11 +220,29 @@ public Result inputSplits(
     final String endInstant = instantToIssue.getTimestamp();
     final AtomicInteger cnt = new AtomicInteger(0);
     final String mergeType = this.conf.getString(FlinkOptions.MERGE_TYPE);
+    final FileSystem fs = FSUtils.getFs(path.toString(), hadoopConf);
     List<MergeOnReadInputSplit> inputSplits = writePartitions.stream()
         .map(relPartitionPath -> fsView.getLatestMergedFileSlicesBeforeOrOn(relPartitionPath, endInstant)
+            .filter(fileSlice -> {
+              Option<String> basePath = fileSlice.getBaseFile().map(BaseFile::getPath);
+              try {
+                return basePath.isPresent() && fs.exists(new org.apache.hadoop.fs.Path(basePath.get()));
+              } catch (IOException e) {
+                LOG.error("Checking exists of base path: {} error", basePath);

Review Comment:
   Can we write a tool method for `FileSlice` by checking file existence by filtering out the non-exist files together ? Maybe name it `filterFileSliceWithValidFiles`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1172187660

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan closed pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
BruceKellan closed pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…
URL: https://github.com/apache/hudi/pull/5953


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1173918594

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "1172187660",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666",
       "triggerID" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9694",
       "triggerID" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "triggerType" : "PUSH"
     }, {
       "hash" : "310e19fbe83b54383f59976194697ecad69d6895",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9706",
       "triggerID" : "310e19fbe83b54383f59976194697ecad69d6895",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 310e19fbe83b54383f59976194697ecad69d6895 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9706) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #5953:
URL: https://github.com/apache/hudi/pull/5953#discussion_r912974418


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##########
@@ -302,6 +312,51 @@ private Stream<HoodieInstant> maySkipCompaction(Stream<HoodieInstant> instants)
         : instants;
   }
 
+  private Stream<FileSlice> filterFileSliceWithValidFiles(FileSystem fs, Stream<FileSlice> fileSlices) {
+    // we need to filter out the base file and log file that does not exist
+    return fileSlices.map(fileSlice -> {
+      List<HoodieLogFile> logFiles = fileSlice.getLogFiles()
+          .filter(logFile -> {
+            try {
+              return fs.exists(logFile.getPath());
+            } catch (IOException e) {
+              LOG.error("Checking exists of log file path: {} error", logFile.getPath().toString());
+              throw new HoodieException(e);
+            }
+          }).collect(Collectors.toList());
+      return generateFileSlice(fileSlice.getPartitionPath(),
+          fileSlice.getBaseInstantTime(),
+          fileSlice.getFileId(),
+          fileSlice.getBaseFile().orElse(null),
+          logFiles);
+    }).filter(fileSlice -> {
+      // we should keep the file slice if any base/log file exists
+      if (fileSlice.getLatestLogFile().isPresent()) {
+        return true;
+      }
+      Option<String> basePath = fileSlice.getBaseFile().map(BaseFile::getPath);
+      try {
+        return basePath.isPresent() && fs.exists(new org.apache.hadoop.fs.Path(basePath.get()));
+      } catch (IOException e) {
+        LOG.error("Checking exists of base path: {} error", basePath);
+        throw new HoodieException(e);
+      }
+    });
+  }
+
+  private FileSlice generateFileSlice(String partitionPath,
+                                      String baseInstant,

Review Comment:
   We may need to consider some fallback mechanism like scan the storage directly when we find any file that does not exist. Let's see the effect for removing check totally first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #5953:
URL: https://github.com/apache/hudi/pull/5953#discussion_r912432257


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##########
@@ -302,6 +311,32 @@ private Stream<HoodieInstant> maySkipCompaction(Stream<HoodieInstant> instants)
         : instants;
   }
 
+  private Stream<FileSlice> filterFileSliceWithValidFiles(FileSystem fs, Stream<FileSlice> fileSlices) {
+    return fileSlices.filter(fileSlice -> {
+      Option<String> basePath = fileSlice.getBaseFile().map(BaseFile::getPath);
+      try {
+        if (!basePath.isPresent()) {
+          return true;
+        }
+        return fs.exists(new org.apache.hadoop.fs.Path(basePath.get()));
+      } catch (IOException e) {
+        LOG.error("Checking exists of base path: {} error", basePath);
+        throw new HoodieException(e);
+      }

Review Comment:
   The logic is still incorrect, when a file slice has base file that does not exist but the log files exist, we should keep it.
   In other words, we should keep the file slice if any base/log file exists :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1172214284

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "1172187660",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666",
       "triggerID" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 167fc6a9792aef5818bcf5d6f92c993d0a5c8352 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654) 
   * 80640cedd8f087691b209c85c857c72ebf8fd855 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1172207806

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1172187660",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 167fc6a9792aef5818bcf5d6f92c993d0a5c8352 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1170081089

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * a1549c7833f8deb53e506f6dcd295cfb270461b2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1171208652

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 167fc6a9792aef5818bcf5d6f92c993d0a5c8352 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #5953:
URL: https://github.com/apache/hudi/pull/5953#discussion_r912871566


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##########
@@ -302,6 +312,51 @@ private Stream<HoodieInstant> maySkipCompaction(Stream<HoodieInstant> instants)
         : instants;
   }
 
+  private Stream<FileSlice> filterFileSliceWithValidFiles(FileSystem fs, Stream<FileSlice> fileSlices) {
+    // we need to filter out the base file and log file that does not exist
+    return fileSlices.map(fileSlice -> {
+      List<HoodieLogFile> logFiles = fileSlice.getLogFiles()
+          .filter(logFile -> {
+            try {
+              return fs.exists(logFile.getPath());
+            } catch (IOException e) {
+              LOG.error("Checking exists of log file path: {} error", logFile.getPath().toString());
+              throw new HoodieException(e);
+            }
+          }).collect(Collectors.toList());
+      return generateFileSlice(fileSlice.getPartitionPath(),
+          fileSlice.getBaseInstantTime(),
+          fileSlice.getFileId(),
+          fileSlice.getBaseFile().orElse(null),
+          logFiles);
+    }).filter(fileSlice -> {
+      // we should keep the file slice if any base/log file exists
+      if (fileSlice.getLatestLogFile().isPresent()) {
+        return true;
+      }
+      Option<String> basePath = fileSlice.getBaseFile().map(BaseFile::getPath);
+      try {
+        return basePath.isPresent() && fs.exists(new org.apache.hadoop.fs.Path(basePath.get()));
+      } catch (IOException e) {
+        LOG.error("Checking exists of base path: {} error", basePath);
+        throw new HoodieException(e);
+      }
+    });
+  }
+
+  private FileSlice generateFileSlice(String partitionPath,
+                                      String baseInstant,

Review Comment:
   I have thought about the patch for a few days and maybe the best way is just removing the existence check, the fs view and timeline should keep the layout completeness
   
   1. we always read to the latest commit for streaming read
   2. for batch read with specific end commit, the user should ensure the existence of the version.
   
   So, just remove the existence check and throws directly if file disappears for some reason.
   [3953.patch.zip](https://github.com/apache/hudi/files/9038692/3953.patch.zip)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1169631098

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a1549c7833f8deb53e506f6dcd295cfb270461b2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1170025869

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * a1549c7833f8deb53e506f6dcd295cfb270461b2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on a diff in pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on code in PR #5953:
URL: https://github.com/apache/hudi/pull/5953#discussion_r912566288


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##########
@@ -302,6 +311,32 @@ private Stream<HoodieInstant> maySkipCompaction(Stream<HoodieInstant> instants)
         : instants;
   }
 
+  private Stream<FileSlice> filterFileSliceWithValidFiles(FileSystem fs, Stream<FileSlice> fileSlices) {
+    return fileSlices.filter(fileSlice -> {
+      Option<String> basePath = fileSlice.getBaseFile().map(BaseFile::getPath);
+      try {
+        if (!basePath.isPresent()) {
+          return true;
+        }
+        return fs.exists(new org.apache.hadoop.fs.Path(basePath.get()));
+      } catch (IOException e) {
+        LOG.error("Checking exists of base path: {} error", basePath);
+        throw new HoodieException(e);
+      }

Review Comment:
   Yes, I get it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1172270186

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "1172187660",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666",
       "triggerID" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80640cedd8f087691b209c85c857c72ebf8fd855 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1173810539

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "1172187660",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666",
       "triggerID" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9694",
       "triggerID" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "triggerType" : "PUSH"
     }, {
       "hash" : "310e19fbe83b54383f59976194697ecad69d6895",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "310e19fbe83b54383f59976194697ecad69d6895",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8c4392fef351bac9a477788bf6c202fd9684ed27 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9694) 
   * 310e19fbe83b54383f59976194697ecad69d6895 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1173856844

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "1172187660",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666",
       "triggerID" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9694",
       "triggerID" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "triggerType" : "PUSH"
     }, {
       "hash" : "310e19fbe83b54383f59976194697ecad69d6895",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9706",
       "triggerID" : "310e19fbe83b54383f59976194697ecad69d6895",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8c4392fef351bac9a477788bf6c202fd9684ed27 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9694) 
   * 310e19fbe83b54383f59976194697ecad69d6895 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9706) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1171153745

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a1549c7833f8deb53e506f6dcd295cfb270461b2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627) 
   * 167fc6a9792aef5818bcf5d6f92c993d0a5c8352 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on a diff in pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on code in PR #5953:
URL: https://github.com/apache/hudi/pull/5953#discussion_r912915029


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##########
@@ -302,6 +312,51 @@ private Stream<HoodieInstant> maySkipCompaction(Stream<HoodieInstant> instants)
         : instants;
   }
 
+  private Stream<FileSlice> filterFileSliceWithValidFiles(FileSystem fs, Stream<FileSlice> fileSlices) {
+    // we need to filter out the base file and log file that does not exist
+    return fileSlices.map(fileSlice -> {
+      List<HoodieLogFile> logFiles = fileSlice.getLogFiles()
+          .filter(logFile -> {
+            try {
+              return fs.exists(logFile.getPath());
+            } catch (IOException e) {
+              LOG.error("Checking exists of log file path: {} error", logFile.getPath().toString());
+              throw new HoodieException(e);
+            }
+          }).collect(Collectors.toList());
+      return generateFileSlice(fileSlice.getPartitionPath(),
+          fileSlice.getBaseInstantTime(),
+          fileSlice.getFileId(),
+          fileSlice.getBaseFile().orElse(null),
+          logFiles);
+    }).filter(fileSlice -> {
+      // we should keep the file slice if any base/log file exists
+      if (fileSlice.getLatestLogFile().isPresent()) {
+        return true;
+      }
+      Option<String> basePath = fileSlice.getBaseFile().map(BaseFile::getPath);
+      try {
+        return basePath.isPresent() && fs.exists(new org.apache.hadoop.fs.Path(basePath.get()));
+      } catch (IOException e) {
+        LOG.error("Checking exists of base path: {} error", basePath);
+        throw new HoodieException(e);
+      }
+    });
+  }
+
+  private FileSlice generateFileSlice(String partitionPath,
+                                      String baseInstant,

Review Comment:
   I have read this patch, you mean we don't do any existence checking, since we are using snapshot read, so from the scan step of split_monitor, the snapshot files we maintain should actually be complete.
   
   Even if we do the existence check, the file may still be incomplete due to the cleaner's mechanism, so we don't need this step now, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1173265606

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "1172187660",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666",
       "triggerID" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9694",
       "triggerID" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80640cedd8f087691b209c85c857c72ebf8fd855 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666) 
   * 8c4392fef351bac9a477788bf6c202fd9684ed27 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9694) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1173262778

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "1172187660",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666",
       "triggerID" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80640cedd8f087691b209c85c857c72ebf8fd855 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666) 
   * 8c4392fef351bac9a477788bf6c202fd9684ed27 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1169627280

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a1549c7833f8deb53e506f6dcd295cfb270461b2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #5953:
URL: https://github.com/apache/hudi/pull/5953#discussion_r912871566


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##########
@@ -302,6 +312,51 @@ private Stream<HoodieInstant> maySkipCompaction(Stream<HoodieInstant> instants)
         : instants;
   }
 
+  private Stream<FileSlice> filterFileSliceWithValidFiles(FileSystem fs, Stream<FileSlice> fileSlices) {
+    // we need to filter out the base file and log file that does not exist
+    return fileSlices.map(fileSlice -> {
+      List<HoodieLogFile> logFiles = fileSlice.getLogFiles()
+          .filter(logFile -> {
+            try {
+              return fs.exists(logFile.getPath());
+            } catch (IOException e) {
+              LOG.error("Checking exists of log file path: {} error", logFile.getPath().toString());
+              throw new HoodieException(e);
+            }
+          }).collect(Collectors.toList());
+      return generateFileSlice(fileSlice.getPartitionPath(),
+          fileSlice.getBaseInstantTime(),
+          fileSlice.getFileId(),
+          fileSlice.getBaseFile().orElse(null),
+          logFiles);
+    }).filter(fileSlice -> {
+      // we should keep the file slice if any base/log file exists
+      if (fileSlice.getLatestLogFile().isPresent()) {
+        return true;
+      }
+      Option<String> basePath = fileSlice.getBaseFile().map(BaseFile::getPath);
+      try {
+        return basePath.isPresent() && fs.exists(new org.apache.hadoop.fs.Path(basePath.get()));
+      } catch (IOException e) {
+        LOG.error("Checking exists of base path: {} error", basePath);
+        throw new HoodieException(e);
+      }
+    });
+  }
+
+  private FileSlice generateFileSlice(String partitionPath,
+                                      String baseInstant,

Review Comment:
   I have thought about the patch for a few days and maybe the best way it just removing the existence check, the fs view and timeline should keep the layout completeness
   
   1. we always read to the latest commit for streaming read
   2. for batch read with specific end commit, the user should ensure the existence of the version.
   
   So, just remove the existence check and throws directly if file disappears for some reason.
   [3953.patch.zip](https://github.com/apache/hudi/files/9038692/3953.patch.zip)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1171149281

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a1549c7833f8deb53e506f6dcd295cfb270461b2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627) 
   * 167fc6a9792aef5818bcf5d6f92c993d0a5c8352 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1169752036

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a1549c7833f8deb53e506f6dcd295cfb270461b2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1172211262

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "1172187660",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 167fc6a9792aef5818bcf5d6f92c993d0a5c8352 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654) 
   * 80640cedd8f087691b209c85c857c72ebf8fd855 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on a diff in pull request #5953: [WIP][HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on code in PR #5953:
URL: https://github.com/apache/hudi/pull/5953#discussion_r909135592


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##########
@@ -216,11 +220,29 @@ public Result inputSplits(
     final String endInstant = instantToIssue.getTimestamp();
     final AtomicInteger cnt = new AtomicInteger(0);
     final String mergeType = this.conf.getString(FlinkOptions.MERGE_TYPE);
+    final FileSystem fs = FSUtils.getFs(path.toString(), hadoopConf);
     List<MergeOnReadInputSplit> inputSplits = writePartitions.stream()
         .map(relPartitionPath -> fsView.getLatestMergedFileSlicesBeforeOrOn(relPartitionPath, endInstant)
+            .filter(fileSlice -> {
+              Option<String> basePath = fileSlice.getBaseFile().map(BaseFile::getPath);
+              try {
+                return basePath.isPresent() && fs.exists(new org.apache.hadoop.fs.Path(basePath.get()));
+              } catch (IOException e) {
+                LOG.error("Checking exists of base path: {} error", basePath);

Review Comment:
   Yes, it's a better way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5953: [HUDI-4314] Improve the performance of reading from the specified ins…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5953:
URL: https://github.com/apache/hudi/pull/5953#issuecomment-1173322769

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a1549c7833f8deb53e506f6dcd295cfb270461b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9627",
       "triggerID" : "1170023135",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "triggerType" : "PUSH"
     }, {
       "hash" : "167fc6a9792aef5818bcf5d6f92c993d0a5c8352",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9654",
       "triggerID" : "1172187660",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9666",
       "triggerID" : "80640cedd8f087691b209c85c857c72ebf8fd855",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9694",
       "triggerID" : "8c4392fef351bac9a477788bf6c202fd9684ed27",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8c4392fef351bac9a477788bf6c202fd9684ed27 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9694) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org