You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/19 07:15:13 UTC

[GitHub] [hudi] boneanxs opened a new pull request, #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

boneanxs opened a new pull request, #6141:
URL: https://github.com/apache/hudi/pull/6141

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1193925338

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9920828682bf32a18f0bc5455d113afc71c09820 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r924288916


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##########
@@ -141,14 +145,47 @@ trait HoodieIncrementalRelationTrait extends HoodieBaseRelation {
   // Validate this Incremental implementation is properly configured
   validate()
 
-  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = timeline.getInstants.iterator().asScala.toList
+  protected def startTimestamp: String = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+  protected def endTimestamp: String = optParams.getOrElse(DataSourceReadOptions.END_INSTANTTIME.key, super.timeline.lastInstant().get.getTimestamp)
+
+  protected def startOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(startTimestamp)
+  protected def endOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(endTimestamp)
+
+  // Fallback to full table scan if any of the following conditions matches:
+  //   1. the start commit is archived
+  //   2. the end commit is archived
+  //   3. there are files in metadata be deleted
+  protected lazy val fullTableScan: Boolean = {
+    val fallbackToFullTableScan = optParams.getOrElse(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key,

Review Comment:
   Here should we align with the Flink side which does not introduce a new param to control wether enable fullTableScan?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r929662544


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##########
@@ -124,14 +128,48 @@ trait HoodieIncrementalRelationTrait extends HoodieBaseRelation {
   // Validate this Incremental implementation is properly configured
   validate()
 
-  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = timeline.getInstants.iterator().asScala.toList
+  protected def startTimestamp: String = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+  protected def endTimestamp: String = optParams.getOrElse(DataSourceReadOptions.END_INSTANTTIME.key, super.timeline.lastInstant().get.getTimestamp)
+
+  protected def startOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(startTimestamp)
+  protected def endOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(endTimestamp)
+
+  // Fallback to full table scan if any of the following conditions matches:
+  //   1. the start commit is archived
+  //   2. the end commit is archived
+  //   3. there are files in metadata be deleted
+  protected lazy val fullTableScan: Boolean = {
+    val fallbackToFullTableScan = optParams.getOrElse(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key,
+      DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.defaultValue).toBoolean
+
+    fallbackToFullTableScan && (startOutOfRange || endOutOfRange || affectedFilesInCommits.exists(fileStatus => !metaClient.getFs.exists(fileStatus.getPath)))
+  }
+
+  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = {
+    if (!endOutOfRange) {
+      // If endTimestamp commit is not archived, will filter instants
+      // before endTimestamp.
+      super.timeline.findInstantsInRange(startTimestamp, endTimestamp).getInstants.iterator().asScala.toList
+    } else {
+      super.timeline.getInstants.iterator().asScala.toList
+    }

Review Comment:
   Is this right ? Why not filter the instants with range ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1197973305

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9920828682bf32a18f0bc5455d113afc71c09820 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309) 
   * ece3f0c7a06af08feda421f183c984fd75ef9526 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1203550484

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ece3f0c7a06af08feda421f183c984fd75ef9526 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421) 
   * 23f96b3ecc8812ffae7f9e692e883cdabba03eb0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1203558115

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 23f96b3ecc8812ffae7f9e692e883cdabba03eb0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r929616286


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelation.scala:
##########
@@ -188,71 +196,90 @@ class IncrementalRelation(val sqlContext: SQLContext,
         case HoodieFileFormat.ORC => "orc"
       }
       sqlContext.sparkContext.hadoopConfiguration.unset("mapreduce.input.pathFilter.class")
+
+      // Fallback to full table scan if any of the following conditions matches:
+      //   1. the start commit is archived
+      //   2. the end commit is archived
+      //   3. there are files in metadata be deleted
+      val fallbackToFullTableScan = optParams.getOrElse(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key,
+        DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.defaultValue).toBoolean
+
       val sOpts = optParams.filter(p => !p._1.equalsIgnoreCase("path"))
-      if (filteredRegularFullPaths.isEmpty && filteredMetaBootstrapFullPaths.isEmpty) {
-        sqlContext.sparkContext.emptyRDD[Row]
-      } else {
-        log.info("Additional Filters to be applied to incremental source are :" + filters.mkString("Array(", ", ", ")"))
 
-        var df: DataFrame = sqlContext.createDataFrame(sqlContext.sparkContext.emptyRDD[Row], usedSchema)
+      val startInstantTime = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+      val startInstantArchived = startInstantTime.compareTo(commitTimeline.firstInstant().get().getTimestamp) < 0 // True if startInstantTime < activeTimeline.first

Review Comment:
   Does `HoodieTimeline#isBeforeTimelineStarts` work here ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1197977726

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9920828682bf32a18f0bc5455d113afc71c09820 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309) 
   * ece3f0c7a06af08feda421f183c984fd75ef9526 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1207759214

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2a493fcafb42e21cbfcae3787ab30853319f4bf3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596) 
   * 2de4df9e16f88a4813d404ba2111a9b4db19c03b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r929614662


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelation.scala:
##########
@@ -71,6 +71,14 @@ class IncrementalRelation(val sqlContext: SQLContext,
     throw new HoodieException(s"Specify the begin instant time to pull from using " +
       s"option ${DataSourceReadOptions.BEGIN_INSTANTTIME.key}")
   }
+
+  if (optParams.contains(DataSourceReadOptions.END_INSTANTTIME.key())) {
+    val startInstantTime = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+    val endInstantTime = optParams(DataSourceReadOptions.END_INSTANTTIME.key())
+    if (endInstantTime.compareTo(startInstantTime) < 0)
+    throw new HoodieException(s"Specify the begin instant time can not be larger than the end instant time")
+  }

Review Comment:
   Should be a valid case to just return empty data set.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1214556609

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10666",
       "triggerID" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c192b29f176c4d861360fd9c70728a57b8ef2926",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10733",
       "triggerID" : "c192b29f176c4d861360fd9c70728a57b8ef2926",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60c68b0bd40a9a681f2865426e9b5bd2152e9931",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "60c68b0bd40a9a681f2865426e9b5bd2152e9931",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c192b29f176c4d861360fd9c70728a57b8ef2926 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10733) 
   * 60c68b0bd40a9a681f2865426e9b5bd2152e9931 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1214652833

   @danny0405 the CI pass~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r925330861


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##########
@@ -141,14 +145,47 @@ trait HoodieIncrementalRelationTrait extends HoodieBaseRelation {
   // Validate this Incremental implementation is properly configured
   validate()
 
-  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = timeline.getInstants.iterator().asScala.toList
+  protected def startTimestamp: String = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+  protected def endTimestamp: String = optParams.getOrElse(DataSourceReadOptions.END_INSTANTTIME.key, super.timeline.lastInstant().get.getTimestamp)
+
+  protected def startOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(startTimestamp)
+  protected def endOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(endTimestamp)
+
+  // Fallback to full table scan if any of the following conditions matches:
+  //   1. the start commit is archived
+  //   2. the end commit is archived
+  //   3. there are files in metadata be deleted
+  protected lazy val fullTableScan: Boolean = {
+    val fallbackToFullTableScan = optParams.getOrElse(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key,

Review Comment:
   > And even if end commit is out of range, the case that the end commit is greater than the latest commit is a valid case
   
   Yea, Looks the `IncrementalRelation` doesn't support this, I'll fix it as well...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1189211544

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ce0b46f4460ee1f9c80cdfdef9824b5c5711135c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r936278070


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestIncrementalReadWithFullTableScan.scala:
##########
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.model.HoodieTableType
+import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions}
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.common.table.timeline.{HoodieInstant, HoodieInstantTimeGenerator, HoodieTimeline}
+import org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN
+import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.testutils.HoodieClientTestBase
+import org.apache.log4j.LogManager
+import org.apache.spark.SparkException
+import org.apache.spark.sql.{AnalysisException, SaveMode, SparkSession}
+import org.junit.jupiter.api.Assertions.{assertEquals, assertThrows, assertTrue}
+import org.junit.jupiter.api.{AfterEach, BeforeEach}
+import org.junit.jupiter.api.function.Executable
+import org.junit.jupiter.params.ParameterizedTest
+import org.junit.jupiter.params.provider.EnumSource
+
+import scala.collection.JavaConversions.asScalaBuffer
+
+class TestIncrementalReadWithFullTableScan extends HoodieClientTestBase {
+
+  var spark: SparkSession = null
+  private val log = LogManager.getLogger(classOf[TestIncrementalReadWithFullTableScan])
+
+  private val perBatchSize = 100
+
+  val commonOpts = Map(
+    "hoodie.insert.shuffle.parallelism" -> "4",
+    "hoodie.upsert.shuffle.parallelism" -> "4",
+    DataSourceWriteOptions.RECORDKEY_FIELD.key -> "_row_key",
+    DataSourceWriteOptions.PARTITIONPATH_FIELD.key -> "partition",
+    DataSourceWriteOptions.PRECOMBINE_FIELD.key -> "timestamp",
+    HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+    HoodieMetadataConfig.COMPACT_NUM_DELTA_COMMITS.key -> "1"
+  )
+
+
+  val verificationCol: String = "driver"
+  val updatedVerificationVal: String = "driver_update"
+
+  @BeforeEach override def setUp() {
+    setTableName("hoodie_test")
+    initPath()
+    initSparkContexts()
+    spark = sqlContext.sparkSession
+    initTestDataGenerator()
+    initFileSystem()
+  }
+
+  @AfterEach override def tearDown() = {
+    cleanupSparkContexts()
+    cleanupTestDataGenerator()
+    cleanupFileSystem()
+  }
+
+  @ParameterizedTest
+  @EnumSource(value = classOf[HoodieTableType])
+  def testFailEarlyForIncrViewQueryForNonExistingFiles(tableType: HoodieTableType): Unit = {
+    // Create 10 commits
+    for (i <- 1 to 10) {
+      val records = recordsToStrings(dataGen.generateInserts("%05d".format(i), perBatchSize)).toList
+      val inputDF = spark.read.json(spark.sparkContext.parallelize(records, 2))
+      inputDF.write.format("org.apache.hudi")
+        .options(commonOpts)
+        .option(DataSourceWriteOptions.TABLE_TYPE.key, tableType.name())
+        .option("hoodie.cleaner.commits.retained", "3")
+        .option("hoodie.keep.min.commits", "4")
+        .option("hoodie.keep.max.commits", "5")
+        .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+        .mode(SaveMode.Append)
+        .save(basePath)
+    }
+
+    val hoodieMetaClient = HoodieTableMetaClient.builder().setConf(spark.sparkContext.hadoopConfiguration).setBasePath(basePath).setLoadActiveTimelineOnLoad(true).build()
+    /**
+     * State of timeline after 10 commits
+     * +------------------+--------------------------------------+
+     * |     Archived     |            Active Timeline           |
+     * +------------------+--------------+-----------------------+
+     * | C0   C1   C2  C3 |    C4   C5   |   C6    C7   C8   C9  |
+     * +------------------+--------------+-----------------------+
+     * |          Data cleaned           |  Data exists in table |
+     * +---------------------------------+-----------------------+
+     */
+
+    val completedCommits = hoodieMetaClient.getCommitsTimeline.filterCompletedInstants() // C4 to C9
+    val archivedInstants = hoodieMetaClient.getArchivedTimeline.filterCompletedInstants()
+      .getInstants.distinct().toArray // C0 to C3
+
+    //Anything less than 2 is a valid commit in the sense no cleanup has been done for those commit files
+    val startUnarchivedCommitTs = completedCommits.nthInstant(0).get().getTimestamp //C4
+    val endUnarchivedCommitTs = completedCommits.nthInstant(1).get().getTimestamp //C5
+
+    val startArchivedCommitTs = archivedInstants(0).asInstanceOf[HoodieInstant].getTimestamp //C0
+    val endArchivedCommitTs = archivedInstants(1).asInstanceOf[HoodieInstant].getTimestamp //C1
+
+    val startOutOfRangeCommitTs = HoodieInstantTimeGenerator.createNewInstantTime(0)
+    val endOutOfRangeCommitTs = HoodieInstantTimeGenerator.createNewInstantTime(0)
+
+    assertTrue(HoodieTimeline.compareTimestamps(startOutOfRangeCommitTs, GREATER_THAN, completedCommits.lastInstant().get().getTimestamp))
+    assertTrue(HoodieTimeline.compareTimestamps(endOutOfRangeCommitTs, GREATER_THAN, completedCommits.lastInstant().get().getTimestamp))
+
+    // Test both start and end commits are archived
+    runIncrementalQueryAndCompare(startArchivedCommitTs, endArchivedCommitTs, 1, true)
+
+    // Test start commit is archived, end commit is not archived
+    shouldThrowIfFallbackIsFalse(tableType,
+      () => runIncrementalQueryAndCompare(startArchivedCommitTs, endUnarchivedCommitTs, 5, false))
+    runIncrementalQueryAndCompare(startArchivedCommitTs, endUnarchivedCommitTs, 5, true)
+
+    // Test both start commit and end commits are not archived but got cleaned
+    shouldThrowIfFallbackIsFalse(tableType,
+      () => runIncrementalQueryAndCompare(startUnarchivedCommitTs, endUnarchivedCommitTs, 1, false))
+    runIncrementalQueryAndCompare(startUnarchivedCommitTs, endUnarchivedCommitTs, 1, true)
+
+    // Test start commit is not archived, end commits is out of the timeline
+    runIncrementalQueryAndCompare(startUnarchivedCommitTs, endOutOfRangeCommitTs, 5, true)
+
+    // Test both start commit and end commits are out of the timeline
+    runIncrementalQueryAndCompare(startOutOfRangeCommitTs, endOutOfRangeCommitTs, 0, false)
+    runIncrementalQueryAndCompare(startOutOfRangeCommitTs, endOutOfRangeCommitTs, 0, true)
+
+    // Test end commit is smaller than the start commit
+    runIncrementalQueryAndCompare(endUnarchivedCommitTs, startUnarchivedCommitTs, 0, false)
+    runIncrementalQueryAndCompare(endUnarchivedCommitTs, startUnarchivedCommitTs, 0, true)
+
+    // Test both start commit and end commits is not archived and not cleaned
+    val reversedCommits = completedCommits.getReverseOrderedInstants.toArray
+    val startUncleanedCommitTs = reversedCommits.apply(1).asInstanceOf[HoodieInstant].getTimestamp
+    val endUncleanedCommitTs = reversedCommits.apply(0).asInstanceOf[HoodieInstant].getTimestamp
+    runIncrementalQueryAndCompare(startUncleanedCommitTs, endUncleanedCommitTs, 1, true)
+    runIncrementalQueryAndCompare(startUncleanedCommitTs, endUncleanedCommitTs, 1, false)
+  }
+
+  private def runIncrementalQueryAndCompare(
+      startTs: String,
+      endTs: String,
+      batchNum: Int,
+      fallBackFullTableScan: Boolean): Unit = {
+    val hoodieIncViewDF = spark.read.format("org.apache.hudi")
+      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+      .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), startTs)
+      .option(DataSourceReadOptions.END_INSTANTTIME.key(), endTs)
+      .option(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key(), fallBackFullTableScan)
+      .load(basePath)
+    assertEquals(perBatchSize * batchNum, hoodieIncViewDF.count())
+  }
+
+  private def shouldThrowIfFallbackIsFalse(tableType: HoodieTableType, fn: () => Unit): Unit = {
+    val msg = "Should fail with Path does not exist"
+    tableType match {
+      case HoodieTableType.COPY_ON_WRITE =>
+        assertThrows(classOf[AnalysisException], new Executable {
+          override def execute(): Unit = {
+            fn()
+          }
+        }, msg)
+      case HoodieTableType.MERGE_ON_READ =>
+        val exp = assertThrows(classOf[SparkException], new Executable {
+          override def execute(): Unit = {
+            fn()
+          }
+        }, msg)
+        assertTrue(exp.getMessage.contains("FileNotFoundException"))
+    }
+  }

Review Comment:
   It will throw two different exception type for MOR tables and COW tables,
   
   for COW table it throws `AnalysisException`
   for MOR table it throws `SparkException`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1206007248

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 23f96b3ecc8812ffae7f9e692e883cdabba03eb0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541) 
   * 2a493fcafb42e21cbfcae3787ab30853319f4bf3 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1188719472

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ce0b46f4460ee1f9c80cdfdef9824b5c5711135c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1194072015

   @danny0405 @nsivabalan could you plz review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r925285786


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##########
@@ -141,14 +145,47 @@ trait HoodieIncrementalRelationTrait extends HoodieBaseRelation {
   // Validate this Incremental implementation is properly configured
   validate()
 
-  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = timeline.getInstants.iterator().asScala.toList
+  protected def startTimestamp: String = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+  protected def endTimestamp: String = optParams.getOrElse(DataSourceReadOptions.END_INSTANTTIME.key, super.timeline.lastInstant().get.getTimestamp)
+
+  protected def startOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(startTimestamp)
+  protected def endOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(endTimestamp)
+
+  // Fallback to full table scan if any of the following conditions matches:
+  //   1. the start commit is archived
+  //   2. the end commit is archived
+  //   3. there are files in metadata be deleted
+  protected lazy val fullTableScan: Boolean = {
+    val fallbackToFullTableScan = optParams.getOrElse(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key,

Review Comment:
   The main change seems add end commit out of range check, so maybe we add a 
   end commit must be greater than start commit constraint.
   
   And even if end commit is out of range, the case that 
   the end commit is greater than the latest commit
   is a valid case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r935203531


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestIncrementalReadWithFullTableScan.scala:
##########
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.model.HoodieTableType
+import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions}
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.common.table.timeline.{HoodieInstant, HoodieInstantTimeGenerator, HoodieTimeline}
+import org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN
+import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.testutils.HoodieClientTestBase
+import org.apache.log4j.LogManager
+import org.apache.spark.SparkException
+import org.apache.spark.sql.{AnalysisException, SaveMode, SparkSession}
+import org.junit.jupiter.api.Assertions.{assertEquals, assertThrows, assertTrue}
+import org.junit.jupiter.api.{AfterEach, BeforeEach}
+import org.junit.jupiter.api.function.Executable
+import org.junit.jupiter.params.ParameterizedTest
+import org.junit.jupiter.params.provider.EnumSource
+
+import scala.collection.JavaConversions.asScalaBuffer
+
+class TestIncrementalReadWithFullTableScan extends HoodieClientTestBase {
+
+  var spark: SparkSession = null
+  private val log = LogManager.getLogger(classOf[TestIncrementalReadWithFullTableScan])
+
+  private val perBatchSize = 100
+
+  val commonOpts = Map(
+    "hoodie.insert.shuffle.parallelism" -> "4",
+    "hoodie.upsert.shuffle.parallelism" -> "4",
+    DataSourceWriteOptions.RECORDKEY_FIELD.key -> "_row_key",
+    DataSourceWriteOptions.PARTITIONPATH_FIELD.key -> "partition",
+    DataSourceWriteOptions.PRECOMBINE_FIELD.key -> "timestamp",
+    HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+    HoodieMetadataConfig.COMPACT_NUM_DELTA_COMMITS.key -> "1"
+  )
+
+
+  val verificationCol: String = "driver"
+  val updatedVerificationVal: String = "driver_update"
+
+  @BeforeEach override def setUp() {
+    setTableName("hoodie_test")
+    initPath()
+    initSparkContexts()
+    spark = sqlContext.sparkSession
+    initTestDataGenerator()
+    initFileSystem()
+  }
+
+  @AfterEach override def tearDown() = {
+    cleanupSparkContexts()
+    cleanupTestDataGenerator()
+    cleanupFileSystem()
+  }
+
+  @ParameterizedTest
+  @EnumSource(value = classOf[HoodieTableType])
+  def testFailEarlyForIncrViewQueryForNonExistingFiles(tableType: HoodieTableType): Unit = {
+    // Create 10 commits
+    for (i <- 1 to 10) {
+      val records = recordsToStrings(dataGen.generateInserts("%05d".format(i), perBatchSize)).toList
+      val inputDF = spark.read.json(spark.sparkContext.parallelize(records, 2))
+      inputDF.write.format("org.apache.hudi")
+        .options(commonOpts)
+        .option(DataSourceWriteOptions.TABLE_TYPE.key, tableType.name())
+        .option("hoodie.cleaner.commits.retained", "3")
+        .option("hoodie.keep.min.commits", "4")
+        .option("hoodie.keep.max.commits", "5")
+        .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+        .mode(SaveMode.Append)
+        .save(basePath)
+    }
+
+    val hoodieMetaClient = HoodieTableMetaClient.builder().setConf(spark.sparkContext.hadoopConfiguration).setBasePath(basePath).setLoadActiveTimelineOnLoad(true).build()
+    /**
+     * State of timeline after 10 commits
+     * +------------------+--------------------------------------+
+     * |     Archived     |            Active Timeline           |
+     * +------------------+--------------+-----------------------+
+     * | C0   C1   C2  C3 |    C4   C5   |   C6    C7   C8   C9  |
+     * +------------------+--------------+-----------------------+
+     * |          Data cleaned           |  Data exists in table |
+     * +---------------------------------+-----------------------+
+     */
+
+    val completedCommits = hoodieMetaClient.getCommitsTimeline.filterCompletedInstants() // C4 to C9
+    val archivedInstants = hoodieMetaClient.getArchivedTimeline.filterCompletedInstants()
+      .getInstants.distinct().toArray // C0 to C3
+
+    //Anything less than 2 is a valid commit in the sense no cleanup has been done for those commit files
+    val startUnarchivedCommitTs = completedCommits.nthInstant(0).get().getTimestamp //C4
+    val endUnarchivedCommitTs = completedCommits.nthInstant(1).get().getTimestamp //C5
+
+    val startArchivedCommitTs = archivedInstants(0).asInstanceOf[HoodieInstant].getTimestamp //C0
+    val endArchivedCommitTs = archivedInstants(1).asInstanceOf[HoodieInstant].getTimestamp //C1
+
+    val startOutOfRangeCommitTs = HoodieInstantTimeGenerator.createNewInstantTime(0)
+    val endOutOfRangeCommitTs = HoodieInstantTimeGenerator.createNewInstantTime(0)
+
+    assertTrue(HoodieTimeline.compareTimestamps(startOutOfRangeCommitTs, GREATER_THAN, completedCommits.lastInstant().get().getTimestamp))
+    assertTrue(HoodieTimeline.compareTimestamps(endOutOfRangeCommitTs, GREATER_THAN, completedCommits.lastInstant().get().getTimestamp))
+
+    // Test both start and end commits are archived
+    runIncrementalQueryAndCompare(startArchivedCommitTs, endArchivedCommitTs, 1, true)
+
+    // Test start commit is archived, end commit is not archived
+    shouldThrowIfFallbackIsFalse(tableType,
+      () => runIncrementalQueryAndCompare(startArchivedCommitTs, endUnarchivedCommitTs, 5, false))
+    runIncrementalQueryAndCompare(startArchivedCommitTs, endUnarchivedCommitTs, 5, true)
+
+    // Test both start commit and end commits are not archived but got cleaned
+    shouldThrowIfFallbackIsFalse(tableType,
+      () => runIncrementalQueryAndCompare(startUnarchivedCommitTs, endUnarchivedCommitTs, 1, false))
+    runIncrementalQueryAndCompare(startUnarchivedCommitTs, endUnarchivedCommitTs, 1, true)
+
+    // Test start commit is not archived, end commits is out of the timeline
+    runIncrementalQueryAndCompare(startUnarchivedCommitTs, endOutOfRangeCommitTs, 5, true)
+
+    // Test both start commit and end commits are out of the timeline
+    runIncrementalQueryAndCompare(startOutOfRangeCommitTs, endOutOfRangeCommitTs, 0, false)
+    runIncrementalQueryAndCompare(startOutOfRangeCommitTs, endOutOfRangeCommitTs, 0, true)
+
+    // Test end commit is smaller than the start commit
+    runIncrementalQueryAndCompare(endUnarchivedCommitTs, startUnarchivedCommitTs, 0, false)
+    runIncrementalQueryAndCompare(endUnarchivedCommitTs, startUnarchivedCommitTs, 0, true)
+
+    // Test both start commit and end commits is not archived and not cleaned
+    val reversedCommits = completedCommits.getReverseOrderedInstants.toArray
+    val startUncleanedCommitTs = reversedCommits.apply(1).asInstanceOf[HoodieInstant].getTimestamp
+    val endUncleanedCommitTs = reversedCommits.apply(0).asInstanceOf[HoodieInstant].getTimestamp
+    runIncrementalQueryAndCompare(startUncleanedCommitTs, endUncleanedCommitTs, 1, true)
+    runIncrementalQueryAndCompare(startUncleanedCommitTs, endUncleanedCommitTs, 1, false)
+  }
+
+  private def runIncrementalQueryAndCompare(
+      startTs: String,
+      endTs: String,
+      batchNum: Int,
+      fallBackFullTableScan: Boolean): Unit = {
+    val hoodieIncViewDF = spark.read.format("org.apache.hudi")
+      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+      .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), startTs)
+      .option(DataSourceReadOptions.END_INSTANTTIME.key(), endTs)
+      .option(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key(), fallBackFullTableScan)
+      .load(basePath)
+    assertEquals(perBatchSize * batchNum, hoodieIncViewDF.count())
+  }
+
+  private def shouldThrowIfFallbackIsFalse(tableType: HoodieTableType, fn: () => Unit): Unit = {
+    val msg = "Should fail with Path does not exist"
+    tableType match {
+      case HoodieTableType.COPY_ON_WRITE =>
+        assertThrows(classOf[AnalysisException], new Executable {
+          override def execute(): Unit = {
+            fn()
+          }
+        }, msg)
+      case HoodieTableType.MERGE_ON_READ =>
+        val exp = assertThrows(classOf[SparkException], new Executable {
+          override def execute(): Unit = {
+            fn()
+          }
+        }, msg)
+        assertTrue(exp.getMessage.contains("FileNotFoundException"))
+    }
+  }

Review Comment:
   Can we just execute the code block first:
   ```java
           assertThrows(classOf[AnalysisException], new Executable {
             override def execute(): Unit = {
               fn()
             }
           }, msg)
   ```
   
   Then check the exception msg especially for MERGE_ON_READ table ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r935199186


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##########
@@ -124,14 +128,48 @@ trait HoodieIncrementalRelationTrait extends HoodieBaseRelation {
   // Validate this Incremental implementation is properly configured
   validate()
 
-  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = timeline.getInstants.iterator().asScala.toList
+  protected def startTimestamp: String = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+  protected def endTimestamp: String = optParams.getOrElse(DataSourceReadOptions.END_INSTANTTIME.key, super.timeline.lastInstant().get.getTimestamp)
+
+  protected def startOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(startTimestamp)
+  protected def endOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(endTimestamp)
+
+  // Fallback to full table scan if any of the following conditions matches:
+  //   1. the start commit is archived
+  //   2. the end commit is archived
+  //   3. there are files in metadata be deleted
+  protected lazy val fullTableScan: Boolean = {
+    val fallbackToFullTableScan = optParams.getOrElse(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key,
+      DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.defaultValue).toBoolean
+
+    fallbackToFullTableScan && (startOutOfRange || endOutOfRange || affectedFilesInCommits.exists(fileStatus => !metaClient.getFs.exists(fileStatus.getPath)))
+  }
+
+  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = {
+    if (!endOutOfRange) {
+      // If endTimestamp commit is not archived, will filter instants
+      // before endTimestamp.
+      super.timeline.findInstantsInRange(startTimestamp, endTimestamp).getInstants.iterator().asScala.toList
+    } else {
+      super.timeline.getInstants.iterator().asScala.toList
+    }

Review Comment:
   But we should also take the start instant into consideration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r932028988


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelation.scala:
##########
@@ -188,71 +196,90 @@ class IncrementalRelation(val sqlContext: SQLContext,
         case HoodieFileFormat.ORC => "orc"
       }
       sqlContext.sparkContext.hadoopConfiguration.unset("mapreduce.input.pathFilter.class")
+
+      // Fallback to full table scan if any of the following conditions matches:
+      //   1. the start commit is archived
+      //   2. the end commit is archived
+      //   3. there are files in metadata be deleted
+      val fallbackToFullTableScan = optParams.getOrElse(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key,
+        DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.defaultValue).toBoolean
+
       val sOpts = optParams.filter(p => !p._1.equalsIgnoreCase("path"))
-      if (filteredRegularFullPaths.isEmpty && filteredMetaBootstrapFullPaths.isEmpty) {
-        sqlContext.sparkContext.emptyRDD[Row]
-      } else {
-        log.info("Additional Filters to be applied to incremental source are :" + filters.mkString("Array(", ", ", ")"))
 
-        var df: DataFrame = sqlContext.createDataFrame(sqlContext.sparkContext.emptyRDD[Row], usedSchema)
+      val startInstantTime = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+      val startInstantArchived = startInstantTime.compareTo(commitTimeline.firstInstant().get().getTimestamp) < 0 // True if startInstantTime < activeTimeline.first

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1188723946

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ce0b46f4460ee1f9c80cdfdef9824b5c5711135c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1189873165

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3d4947ef1741947e8dcd5e5125837e451dda6049 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1206063311

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2a493fcafb42e21cbfcae3787ab30853319f4bf3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r929666631


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestIncrementalReadWithFullTableScan.scala:
##########
@@ -0,0 +1,191 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.model.HoodieTableType
+import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions}
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.common.table.timeline.HoodieInstant
+import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.testutils.HoodieClientTestBase
+import org.apache.log4j.LogManager
+import org.apache.spark.SparkException
+import org.apache.spark.sql.{AnalysisException, SaveMode, SparkSession}
+import org.junit.jupiter.api.Assertions.{assertEquals, assertThrows, assertTrue}
+import org.junit.jupiter.api.{AfterEach, BeforeEach}
+import org.junit.jupiter.api.function.Executable
+import org.junit.jupiter.params.ParameterizedTest
+import org.junit.jupiter.params.provider.EnumSource
+
+import scala.collection.JavaConversions.asScalaBuffer
+
+class TestIncrementalReadWithFullTableScan extends HoodieClientTestBase {
+
+  var spark: SparkSession = null
+  private val log = LogManager.getLogger(classOf[TestIncrementalReadWithFullTableScan])
+  val commonOpts = Map(
+    "hoodie.insert.shuffle.parallelism" -> "4",
+    "hoodie.upsert.shuffle.parallelism" -> "4",
+    DataSourceWriteOptions.RECORDKEY_FIELD.key -> "_row_key",
+    DataSourceWriteOptions.PARTITIONPATH_FIELD.key -> "partition",
+    DataSourceWriteOptions.PRECOMBINE_FIELD.key -> "timestamp",
+    HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+    HoodieMetadataConfig.COMPACT_NUM_DELTA_COMMITS.key -> "1"
+  )
+
+
+  val verificationCol: String = "driver"
+  val updatedVerificationVal: String = "driver_update"
+
+  @BeforeEach override def setUp() {
+    setTableName("hoodie_test")
+    initPath()
+    initSparkContexts()
+    spark = sqlContext.sparkSession
+    initTestDataGenerator()
+    initFileSystem()
+  }
+
+  @AfterEach override def tearDown() = {
+    cleanupSparkContexts()
+    cleanupTestDataGenerator()
+    cleanupFileSystem()
+  }
+
+  @ParameterizedTest
+  @EnumSource(value = classOf[HoodieTableType])
+  def testFailEarlyForIncrViewQueryForNonExistingFiles(tableType: HoodieTableType): Unit = {
+    // Create 10 commits
+    for (i <- 1 to 10) {
+      val records = recordsToStrings(dataGen.generateInserts("%05d".format(i), 100)).toList
+      val inputDF = spark.read.json(spark.sparkContext.parallelize(records, 2))
+      inputDF.write.format("org.apache.hudi")
+        .options(commonOpts)
+        .option(DataSourceWriteOptions.TABLE_TYPE.key, tableType.name())
+        .option("hoodie.cleaner.commits.retained", "3")
+        .option("hoodie.keep.min.commits", "4")
+        .option("hoodie.keep.max.commits", "5")
+        .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+        .mode(SaveMode.Append)
+        .save(basePath)
+    }
+
+    val hoodieMetaClient = HoodieTableMetaClient.builder().setConf(spark.sparkContext.hadoopConfiguration).setBasePath(basePath).setLoadActiveTimelineOnLoad(true).build()
+    /**
+     * State of timeline after 10 commits
+     * +------------------+--------------------------------------+
+     * |     Archived     |            Active Timeline           |
+     * +------------------+--------------+-----------------------+
+     * | C0   C1   C2  C3 |    C4   C5   |   C6    C7   C8   C9  |
+     * +------------------+--------------+-----------------------+
+     * |          Data cleaned           |  Data exists in table |
+     * +---------------------------------+-----------------------+
+     */
+
+    val completedCommits = hoodieMetaClient.getCommitsTimeline.filterCompletedInstants() // C4 to C9
+    //Anything less than 2 is a valid commit in the sense no cleanup has been done for those commit files
+    var startTs = completedCommits.nthInstant(0).get().getTimestamp //C4
+    var endTs = completedCommits.nthInstant(1).get().getTimestamp //C5
+
+    //Calling without the fallback should result in Path does not exist
+    var hoodieIncViewDF = spark.read.format("org.apache.hudi")
+      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+      .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), startTs)
+      .option(DataSourceReadOptions.END_INSTANTTIME.key(), endTs)
+      .load(basePath)
+
+    val msg = "Should fail with Path does not exist"
+    tableType match {
+      case HoodieTableType.COPY_ON_WRITE =>
+        assertThrows(classOf[AnalysisException], new Executable {
+          override def execute(): Unit = {
+            hoodieIncViewDF.count()
+          }
+        }, msg)
+      case HoodieTableType.MERGE_ON_READ =>
+        val exp = assertThrows(classOf[SparkException], new Executable {
+          override def execute(): Unit = {
+            hoodieIncViewDF.count()
+          }
+        }, msg)
+        assertTrue(exp.getMessage.contains("FileNotFoundException"))
+    }
+
+
+    //Should work with fallback enabled
+    hoodieIncViewDF = spark.read.format("org.apache.hudi")
+      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+      .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), startTs)
+      .option(DataSourceReadOptions.END_INSTANTTIME.key(), endTs)
+      .option(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key(), "true")
+      .load(basePath)
+    assertEquals(100, hoodieIncViewDF.count())
+
+    //Test out for archived commits
+    val archivedInstants = hoodieMetaClient.getArchivedTimeline.filterCompletedInstants().getInstants.distinct().toArray
+    startTs = archivedInstants(0).asInstanceOf[HoodieInstant].getTimestamp //C0
+    endTs = completedCommits.nthInstant(1).get().getTimestamp //C5
+
+    //Calling without the fallback should result in Path does not exist
+    hoodieIncViewDF = spark.read.format("org.apache.hudi")
+      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+      .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), startTs)
+      .option(DataSourceReadOptions.END_INSTANTTIME.key(), endTs)
+      .load(basePath)
+
+    tableType match {
+      case HoodieTableType.COPY_ON_WRITE =>
+        assertThrows(classOf[AnalysisException], new Executable {
+          override def execute(): Unit = {
+            hoodieIncViewDF.count()
+          }
+        }, msg)
+      case HoodieTableType.MERGE_ON_READ =>
+        val exp = assertThrows(classOf[SparkException], new Executable {
+          override def execute(): Unit = {
+            hoodieIncViewDF.count()
+          }
+        }, msg)
+        assertTrue(exp.getMessage.contains("FileNotFoundException"))
+    }
+
+    //Should work with fallback enabled
+    hoodieIncViewDF = spark.read.format("org.apache.hudi")
+      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+      .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), startTs)
+      .option(DataSourceReadOptions.END_INSTANTTIME.key(), endTs)
+      .option(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key(), "true")
+      .load(basePath)
+    assertEquals(500, hoodieIncViewDF.count())

Review Comment:
   Should test for following cases:
   1. start commit is archived / end commit is archived
   2. start commit is archived / end commit not archived
   3. start and end commit are both active
   4. start commit is active / end commit is out of range (great than the latest commit)
   5. start and end commit are both greater than the latest commit
   6. end commit is smaller than start commit (returns empty directly)
   
   For cases that any commit is archived, should test a case that some files are cleaned then fallback to full table scan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r925286068


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala:
##########
@@ -857,89 +857,6 @@ class TestCOWDataSource extends HoodieClientTestBase {
     assertEquals(numRecords - numRecordsToDelete, snapshotDF2.count())
   }
 
-  @Test def testFailEarlyForIncrViewQueryForNonExistingFiles(): Unit = {
-    // Create 10 commits
-    for (i <- 1 to 10) {
-      val records = recordsToStrings(dataGen.generateInserts("%05d".format(i), 100)).toList
-      val inputDF = spark.read.json(spark.sparkContext.parallelize(records, 2))
-      inputDF.write.format("org.apache.hudi")
-        .options(commonOpts)
-        .option("hoodie.cleaner.commits.retained", "3")
-        .option("hoodie.keep.min.commits", "4")
-        .option("hoodie.keep.max.commits", "5")
-        .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
-        .mode(SaveMode.Append)
-        .save(basePath)
-    }
-

Review Comment:
   Why moving the test around ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1214558458

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10666",
       "triggerID" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c192b29f176c4d861360fd9c70728a57b8ef2926",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10733",
       "triggerID" : "c192b29f176c4d861360fd9c70728a57b8ef2926",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60c68b0bd40a9a681f2865426e9b5bd2152e9931",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10761",
       "triggerID" : "60c68b0bd40a9a681f2865426e9b5bd2152e9931",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c192b29f176c4d861360fd9c70728a57b8ef2926 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10733) 
   * 60c68b0bd40a9a681f2865426e9b5bd2152e9931 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10761) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r930923425


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestIncrementalReadWithFullTableScan.scala:
##########
@@ -0,0 +1,191 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.model.HoodieTableType
+import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions}
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.common.table.timeline.HoodieInstant
+import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.testutils.HoodieClientTestBase
+import org.apache.log4j.LogManager
+import org.apache.spark.SparkException
+import org.apache.spark.sql.{AnalysisException, SaveMode, SparkSession}
+import org.junit.jupiter.api.Assertions.{assertEquals, assertThrows, assertTrue}
+import org.junit.jupiter.api.{AfterEach, BeforeEach}
+import org.junit.jupiter.api.function.Executable
+import org.junit.jupiter.params.ParameterizedTest
+import org.junit.jupiter.params.provider.EnumSource
+
+import scala.collection.JavaConversions.asScalaBuffer
+
+class TestIncrementalReadWithFullTableScan extends HoodieClientTestBase {
+
+  var spark: SparkSession = null
+  private val log = LogManager.getLogger(classOf[TestIncrementalReadWithFullTableScan])
+  val commonOpts = Map(
+    "hoodie.insert.shuffle.parallelism" -> "4",
+    "hoodie.upsert.shuffle.parallelism" -> "4",
+    DataSourceWriteOptions.RECORDKEY_FIELD.key -> "_row_key",
+    DataSourceWriteOptions.PARTITIONPATH_FIELD.key -> "partition",
+    DataSourceWriteOptions.PRECOMBINE_FIELD.key -> "timestamp",
+    HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+    HoodieMetadataConfig.COMPACT_NUM_DELTA_COMMITS.key -> "1"
+  )
+
+
+  val verificationCol: String = "driver"
+  val updatedVerificationVal: String = "driver_update"
+
+  @BeforeEach override def setUp() {
+    setTableName("hoodie_test")
+    initPath()
+    initSparkContexts()
+    spark = sqlContext.sparkSession
+    initTestDataGenerator()
+    initFileSystem()
+  }
+
+  @AfterEach override def tearDown() = {
+    cleanupSparkContexts()
+    cleanupTestDataGenerator()
+    cleanupFileSystem()
+  }
+
+  @ParameterizedTest
+  @EnumSource(value = classOf[HoodieTableType])
+  def testFailEarlyForIncrViewQueryForNonExistingFiles(tableType: HoodieTableType): Unit = {
+    // Create 10 commits
+    for (i <- 1 to 10) {
+      val records = recordsToStrings(dataGen.generateInserts("%05d".format(i), 100)).toList
+      val inputDF = spark.read.json(spark.sparkContext.parallelize(records, 2))
+      inputDF.write.format("org.apache.hudi")
+        .options(commonOpts)
+        .option(DataSourceWriteOptions.TABLE_TYPE.key, tableType.name())
+        .option("hoodie.cleaner.commits.retained", "3")
+        .option("hoodie.keep.min.commits", "4")
+        .option("hoodie.keep.max.commits", "5")
+        .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+        .mode(SaveMode.Append)
+        .save(basePath)
+    }
+
+    val hoodieMetaClient = HoodieTableMetaClient.builder().setConf(spark.sparkContext.hadoopConfiguration).setBasePath(basePath).setLoadActiveTimelineOnLoad(true).build()
+    /**
+     * State of timeline after 10 commits
+     * +------------------+--------------------------------------+
+     * |     Archived     |            Active Timeline           |
+     * +------------------+--------------+-----------------------+
+     * | C0   C1   C2  C3 |    C4   C5   |   C6    C7   C8   C9  |
+     * +------------------+--------------+-----------------------+
+     * |          Data cleaned           |  Data exists in table |
+     * +---------------------------------+-----------------------+
+     */
+
+    val completedCommits = hoodieMetaClient.getCommitsTimeline.filterCompletedInstants() // C4 to C9
+    //Anything less than 2 is a valid commit in the sense no cleanup has been done for those commit files
+    var startTs = completedCommits.nthInstant(0).get().getTimestamp //C4
+    var endTs = completedCommits.nthInstant(1).get().getTimestamp //C5
+
+    //Calling without the fallback should result in Path does not exist
+    var hoodieIncViewDF = spark.read.format("org.apache.hudi")
+      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+      .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), startTs)
+      .option(DataSourceReadOptions.END_INSTANTTIME.key(), endTs)
+      .load(basePath)
+
+    val msg = "Should fail with Path does not exist"
+    tableType match {
+      case HoodieTableType.COPY_ON_WRITE =>
+        assertThrows(classOf[AnalysisException], new Executable {
+          override def execute(): Unit = {
+            hoodieIncViewDF.count()
+          }
+        }, msg)
+      case HoodieTableType.MERGE_ON_READ =>
+        val exp = assertThrows(classOf[SparkException], new Executable {
+          override def execute(): Unit = {
+            hoodieIncViewDF.count()
+          }
+        }, msg)
+        assertTrue(exp.getMessage.contains("FileNotFoundException"))
+    }
+
+
+    //Should work with fallback enabled
+    hoodieIncViewDF = spark.read.format("org.apache.hudi")
+      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+      .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), startTs)
+      .option(DataSourceReadOptions.END_INSTANTTIME.key(), endTs)
+      .option(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key(), "true")
+      .load(basePath)
+    assertEquals(100, hoodieIncViewDF.count())
+
+    //Test out for archived commits
+    val archivedInstants = hoodieMetaClient.getArchivedTimeline.filterCompletedInstants().getInstants.distinct().toArray
+    startTs = archivedInstants(0).asInstanceOf[HoodieInstant].getTimestamp //C0
+    endTs = completedCommits.nthInstant(1).get().getTimestamp //C5
+
+    //Calling without the fallback should result in Path does not exist
+    hoodieIncViewDF = spark.read.format("org.apache.hudi")
+      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+      .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), startTs)
+      .option(DataSourceReadOptions.END_INSTANTTIME.key(), endTs)
+      .load(basePath)
+
+    tableType match {
+      case HoodieTableType.COPY_ON_WRITE =>
+        assertThrows(classOf[AnalysisException], new Executable {
+          override def execute(): Unit = {
+            hoodieIncViewDF.count()
+          }
+        }, msg)
+      case HoodieTableType.MERGE_ON_READ =>
+        val exp = assertThrows(classOf[SparkException], new Executable {
+          override def execute(): Unit = {
+            hoodieIncViewDF.count()
+          }
+        }, msg)
+        assertTrue(exp.getMessage.contains("FileNotFoundException"))
+    }
+
+    //Should work with fallback enabled
+    hoodieIncViewDF = spark.read.format("org.apache.hudi")
+      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+      .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), startTs)
+      .option(DataSourceReadOptions.END_INSTANTTIME.key(), endTs)
+      .option(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key(), "true")
+      .load(basePath)
+    assertEquals(500, hoodieIncViewDF.count())

Review Comment:
   Sure, will cover these scenarios these days.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1193728726

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3d4947ef1741947e8dcd5e5125837e451dda6049 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081) 
   * 9920828682bf32a18f0bc5455d113afc71c09820 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1193723820

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3d4947ef1741947e8dcd5e5125837e451dda6049 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081) 
   * 9920828682bf32a18f0bc5455d113afc71c09820 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1207764503

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10666",
       "triggerID" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2a493fcafb42e21cbfcae3787ab30853319f4bf3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596) 
   * 2de4df9e16f88a4813d404ba2111a9b4db19c03b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10666) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1203546981

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ece3f0c7a06af08feda421f183c984fd75ef9526 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421) 
   * 23f96b3ecc8812ffae7f9e692e883cdabba03eb0 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1206029481

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 23f96b3ecc8812ffae7f9e692e883cdabba03eb0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541) 
   * 2a493fcafb42e21cbfcae3787ab30853319f4bf3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r932028449


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##########
@@ -124,14 +128,48 @@ trait HoodieIncrementalRelationTrait extends HoodieBaseRelation {
   // Validate this Incremental implementation is properly configured
   validate()
 
-  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = timeline.getInstants.iterator().asScala.toList
+  protected def startTimestamp: String = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+  protected def endTimestamp: String = optParams.getOrElse(DataSourceReadOptions.END_INSTANTTIME.key, super.timeline.lastInstant().get.getTimestamp)
+
+  protected def startOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(startTimestamp)
+  protected def endOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(endTimestamp)
+
+  // Fallback to full table scan if any of the following conditions matches:
+  //   1. the start commit is archived
+  //   2. the end commit is archived
+  //   3. there are files in metadata be deleted
+  protected lazy val fullTableScan: Boolean = {
+    val fallbackToFullTableScan = optParams.getOrElse(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key,
+      DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.defaultValue).toBoolean
+
+    fallbackToFullTableScan && (startOutOfRange || endOutOfRange || affectedFilesInCommits.exists(fileStatus => !metaClient.getFs.exists(fileStatus.getPath)))
+  }
+
+  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = {
+    if (!endOutOfRange) {
+      // If endTimestamp commit is not archived, will filter instants
+      // before endTimestamp.
+      super.timeline.findInstantsInRange(startTimestamp, endTimestamp).getInstants.iterator().asScala.toList
+    } else {
+      super.timeline.getInstants.iterator().asScala.toList
+    }

Review Comment:
   If endInstantArchived, we'll get all instants from timeline, this is used to get latest commit, same behavior in the flink side: 
   
   ```java
   // Step3: decides the read end commit
       final String endInstant = fullTableScan
           ? commitTimeline.lastInstant().get().getTimestamp()
           : instants.get(instants.size() - 1).getTimestamp();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1198098023

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ece3f0c7a06af08feda421f183c984fd75ef9526 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1211618489

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10666",
       "triggerID" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c192b29f176c4d861360fd9c70728a57b8ef2926",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c192b29f176c4d861360fd9c70728a57b8ef2926",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2de4df9e16f88a4813d404ba2111a9b4db19c03b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10666) 
   * c192b29f176c4d861360fd9c70728a57b8ef2926 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1211622555

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10666",
       "triggerID" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c192b29f176c4d861360fd9c70728a57b8ef2926",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10733",
       "triggerID" : "c192b29f176c4d861360fd9c70728a57b8ef2926",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c192b29f176c4d861360fd9c70728a57b8ef2926 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10733) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1189732427

   Another thinking should we also let `IncrementalRelation` extend `HoodieBaseRelation` and `HoodieIncrementalRelationTrait`?  There are many common codes exists in these classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1189745533

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ce0b46f4460ee1f9c80cdfdef9824b5c5711135c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060) 
   * 3d4947ef1741947e8dcd5e5125837e451dda6049 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1189747619

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ce0b46f4460ee1f9c80cdfdef9824b5c5711135c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060) 
   * 3d4947ef1741947e8dcd5e5125837e451dda6049 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r925328888


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala:
##########
@@ -857,89 +857,6 @@ class TestCOWDataSource extends HoodieClientTestBase {
     assertEquals(numRecords - numRecordsToDelete, snapshotDF2.count())
   }
 
-  @Test def testFailEarlyForIncrViewQueryForNonExistingFiles(): Unit = {
-    // Create 10 commits
-    for (i <- 1 to 10) {
-      val records = recordsToStrings(dataGen.generateInserts("%05d".format(i), 100)).toList
-      val inputDF = spark.read.json(spark.sparkContext.parallelize(records, 2))
-      inputDF.write.format("org.apache.hudi")
-        .options(commonOpts)
-        .option("hoodie.cleaner.commits.retained", "3")
-        .option("hoodie.keep.min.commits", "4")
-        .option("hoodie.keep.max.commits", "5")
-        .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
-        .mode(SaveMode.Append)
-        .save(basePath)
-    }
-

Review Comment:
   As we don't need to write same codes(generate data) both in the `TestCOWDataSource` and `TestMORDataSource`, so move all these codes to the new test `TestIncrementalReadWithFullTableScan `



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1214622522

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10666",
       "triggerID" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c192b29f176c4d861360fd9c70728a57b8ef2926",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10733",
       "triggerID" : "c192b29f176c4d861360fd9c70728a57b8ef2926",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60c68b0bd40a9a681f2865426e9b5bd2152e9931",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10761",
       "triggerID" : "60c68b0bd40a9a681f2865426e9b5bd2152e9931",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 60c68b0bd40a9a681f2865426e9b5bd2152e9931 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10761) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 merged pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
danny0405 merged PR #6141:
URL: https://github.com/apache/hudi/pull/6141


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1207956950

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10060",
       "triggerID" : "ce0b46f4460ee1f9c80cdfdef9824b5c5711135c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10081",
       "triggerID" : "3d4947ef1741947e8dcd5e5125837e451dda6049",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10309",
       "triggerID" : "9920828682bf32a18f0bc5455d113afc71c09820",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10421",
       "triggerID" : "ece3f0c7a06af08feda421f183c984fd75ef9526",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541",
       "triggerID" : "23f96b3ecc8812ffae7f9e692e883cdabba03eb0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596",
       "triggerID" : "2a493fcafb42e21cbfcae3787ab30853319f4bf3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10666",
       "triggerID" : "2de4df9e16f88a4813d404ba2111a9b4db19c03b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2de4df9e16f88a4813d404ba2111a9b4db19c03b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10666) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6141:
URL: https://github.com/apache/hudi/pull/6141#discussion_r936274199


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##########
@@ -124,14 +128,48 @@ trait HoodieIncrementalRelationTrait extends HoodieBaseRelation {
   // Validate this Incremental implementation is properly configured
   validate()
 
-  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = timeline.getInstants.iterator().asScala.toList
+  protected def startTimestamp: String = optParams(DataSourceReadOptions.BEGIN_INSTANTTIME.key)
+  protected def endTimestamp: String = optParams.getOrElse(DataSourceReadOptions.END_INSTANTTIME.key, super.timeline.lastInstant().get.getTimestamp)
+
+  protected def startOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(startTimestamp)
+  protected def endOutOfRange: Boolean = super.timeline.isBeforeTimelineStarts(endTimestamp)
+
+  // Fallback to full table scan if any of the following conditions matches:
+  //   1. the start commit is archived
+  //   2. the end commit is archived
+  //   3. there are files in metadata be deleted
+  protected lazy val fullTableScan: Boolean = {
+    val fallbackToFullTableScan = optParams.getOrElse(DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.key,
+      DataSourceReadOptions.INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES.defaultValue).toBoolean
+
+    fallbackToFullTableScan && (startOutOfRange || endOutOfRange || affectedFilesInCommits.exists(fileStatus => !metaClient.getFs.exists(fileStatus.getPath)))
+  }
+
+  protected lazy val includedCommits: immutable.Seq[HoodieInstant] = {
+    if (!endOutOfRange) {
+      // If endTimestamp commit is not archived, will filter instants
+      // before endTimestamp.
+      super.timeline.findInstantsInRange(startTimestamp, endTimestamp).getInstants.iterator().asScala.toList
+    } else {
+      super.timeline.getInstants.iterator().asScala.toList
+    }

Review Comment:
   Yea, there could be a situation that startInstant is not archived, while endInstant is archived, should return empty commits



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org