You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/11 07:31:45 UTC

[GitHub] [hudi] boneanxs opened a new pull request, #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

boneanxs opened a new pull request, #6921:
URL: https://github.com/apache/hudi/pull/6921

   ### Change Logs
   
   If users create an incrementalRelation while join another existing hive hudi table, as pathFilter is unset inside incrementalRelation, all files under hive hudi table will be selected.
   
   Now HoodieROTablePathFilter can accept as.of.instant to do the time travel, so instead we pass as.of.instant to the dataframe(not change spark hadoop conf globally) to avoid this issue.
   
   ### Impact
   
   **Risk level: low**
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6921:
URL: https://github.com/apache/hudi/pull/6921#discussion_r1020557283


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSourceStorage.scala:
##########
@@ -167,12 +167,15 @@ class TestCOWDataSourceStorage extends SparkClientFunctionalTestHarness {
     // Read Incremental Query
     // we have 2 commits, try pulling the first commit (which is not the latest)
     val firstCommit = HoodieDataSourceHelpers.listCommitsSince(fs, basePath, "000").get(0)
+    // Setting HoodieROTablePathFilter here to test whether pathFilter can filter out correctly for IncrementalRelation
+    spark.sparkContext.hadoopConfiguration.set("mapreduce.input.pathFilter.class", "org.apache.hudi.hadoop.HoodieROTablePathFilter")

Review Comment:
   Why are we setting a filter from the test? Isn't it supposed to be set by the Relation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6921:
URL: https://github.com/apache/hudi/pull/6921#discussion_r992188630


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSourceStorage.scala:
##########
@@ -167,12 +167,15 @@ class TestCOWDataSourceStorage extends SparkClientFunctionalTestHarness {
     // Read Incremental Query
     // we have 2 commits, try pulling the first commit (which is not the latest)
     val firstCommit = HoodieDataSourceHelpers.listCommitsSince(fs, basePath, "000").get(0)
+    // Setting HoodieROTablePathFilter here to test whether pathFilter can filter out correctly for IncrementalRelation
+    spark.sparkContext.hadoopConfiguration.set("mapreduce.input.pathFilter.class", "org.apache.hudi.hadoop.HoodieROTablePathFilter")

Review Comment:
   Here fix the test added: https://github.com/apache/hudi/pull/458



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6921:
URL: https://github.com/apache/hudi/pull/6921#discussion_r1023483789


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSourceStorage.scala:
##########
@@ -167,12 +167,15 @@ class TestCOWDataSourceStorage extends SparkClientFunctionalTestHarness {
     // Read Incremental Query
     // we have 2 commits, try pulling the first commit (which is not the latest)
     val firstCommit = HoodieDataSourceHelpers.listCommitsSince(fs, basePath, "000").get(0)
+    // Setting HoodieROTablePathFilter here to test whether pathFilter can filter out correctly for IncrementalRelation
+    spark.sparkContext.hadoopConfiguration.set("mapreduce.input.pathFilter.class", "org.apache.hudi.hadoop.HoodieROTablePathFilter")

Review Comment:
   Yea, there is a test in `org.apache.hudi.functional.TestCOWDataSource#testReadPathsOnCopyOnWriteTable` https://github.com/apache/hudi/blob/7e7b3a866b549021c5a7ad9f89f0da90aff7da68/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala#L341 actually test this. `toHadoopRelation` will add the `HoodieROTablePathFilter`, and this test also contains old version files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6921:
URL: https://github.com/apache/hudi/pull/6921#issuecomment-1291656413

   Gentle ping @alexeykudinkin


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6921:
URL: https://github.com/apache/hudi/pull/6921#issuecomment-1277295327

   Hi @alexeykudinkin  could you plz help to review this pr?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6921:
URL: https://github.com/apache/hudi/pull/6921#discussion_r1021987922


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSourceStorage.scala:
##########
@@ -167,12 +167,15 @@ class TestCOWDataSourceStorage extends SparkClientFunctionalTestHarness {
     // Read Incremental Query
     // we have 2 commits, try pulling the first commit (which is not the latest)
     val firstCommit = HoodieDataSourceHelpers.listCommitsSince(fs, basePath, "000").get(0)
+    // Setting HoodieROTablePathFilter here to test whether pathFilter can filter out correctly for IncrementalRelation
+    spark.sparkContext.hadoopConfiguration.set("mapreduce.input.pathFilter.class", "org.apache.hudi.hadoop.HoodieROTablePathFilter")

Review Comment:
   I see now. Thanks for clarifying!
   Shouldn't we write the test that would set `HoodieROTablePathFilter` (by using globbing for ex)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6921:
URL: https://github.com/apache/hudi/pull/6921#issuecomment-1274241183

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bc10332052af54438fc19268405856b63bce34f7",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12126",
       "triggerID" : "bc10332052af54438fc19268405856b63bce34f7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bc10332052af54438fc19268405856b63bce34f7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12126) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6921:
URL: https://github.com/apache/hudi/pull/6921#issuecomment-1274732257

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bc10332052af54438fc19268405856b63bce34f7",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12126",
       "triggerID" : "bc10332052af54438fc19268405856b63bce34f7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bc10332052af54438fc19268405856b63bce34f7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12126) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6921:
URL: https://github.com/apache/hudi/pull/6921#issuecomment-1274233443

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bc10332052af54438fc19268405856b63bce34f7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bc10332052af54438fc19268405856b63bce34f7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bc10332052af54438fc19268405856b63bce34f7 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin merged pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin merged PR #6921:
URL: https://github.com/apache/hudi/pull/6921


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6921: [HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6921:
URL: https://github.com/apache/hudi/pull/6921#discussion_r1020707598


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSourceStorage.scala:
##########
@@ -167,12 +167,15 @@ class TestCOWDataSourceStorage extends SparkClientFunctionalTestHarness {
     // Read Incremental Query
     // we have 2 commits, try pulling the first commit (which is not the latest)
     val firstCommit = HoodieDataSourceHelpers.listCommitsSince(fs, basePath, "000").get(0)
+    // Setting HoodieROTablePathFilter here to test whether pathFilter can filter out correctly for IncrementalRelation
+    spark.sparkContext.hadoopConfiguration.set("mapreduce.input.pathFilter.class", "org.apache.hudi.hadoop.HoodieROTablePathFilter")

Review Comment:
   I think the intend of the test is to test if `HoodieROTablePathFilter` is set, the incremental relation still can read the old data correctly. But this test doesn't work as our expect, as `HoodieROTablePathFilter` is not set by default.
   
   can see if we run this test without setting pathFilter explicitly.
   <img width="1508" alt="Screen Shot 2022-11-12 at 14 59 41" src="https://user-images.githubusercontent.com/10115332/201461922-b546e09d-eeaa-44c0-acdb-01f89ac830ec.png">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org