You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/28 08:40:12 UTC

[GitHub] [hudi] codope opened a new pull request, #6817: [HUDI-4942] Fix RowSource schema provider

codope opened a new pull request, #6817:
URL: https://github.com/apache/hudi/pull/6817

   ### Change Logs
   
   Default value being provided by schema provider is being lost since RowSource sets a RowBasedSchemaProvider for the InputBatch. This PR fixes it by passing the user-specified schema provider.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6817:
URL: https://github.com/apache/hudi/pull/6817#issuecomment-1262945708

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11835",
       "triggerID" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e1589ebfa7aea943040a85de3b93a4613b365d83",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11908",
       "triggerID" : "e1589ebfa7aea943040a85de3b93a4613b365d83",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e1589ebfa7aea943040a85de3b93a4613b365d83 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11908) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
codope closed pull request #6817: [HUDI-4942] Fix RowSource schema provider
URL: https://github.com/apache/hudi/pull/6817


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6817:
URL: https://github.com/apache/hudi/pull/6817#issuecomment-1261280560

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11835",
       "triggerID" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11835) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6817:
URL: https://github.com/apache/hudi/pull/6817#issuecomment-1260633933

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11835",
       "triggerID" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11835) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6817:
URL: https://github.com/apache/hudi/pull/6817#issuecomment-1262604571

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11835",
       "triggerID" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e1589ebfa7aea943040a85de3b93a4613b365d83",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11908",
       "triggerID" : "e1589ebfa7aea943040a85de3b93a4613b365d83",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11835) 
   * e1589ebfa7aea943040a85de3b93a4613b365d83 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11908) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
codope commented on PR #6817:
URL: https://github.com/apache/hudi/pull/6817#issuecomment-1300025121

   Closing the PR. We need to root cause the issue. Something more is happening here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #6817:
URL: https://github.com/apache/hudi/pull/6817#discussion_r983165496


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/RowSource.java:
##########
@@ -41,6 +41,9 @@ public RowSource(TypedProperties props, JavaSparkContext sparkContext, SparkSess
   @Override
   protected final InputBatch<Dataset<Row>> fetchNewData(Option<String> lastCkptStr, long sourceLimit) {
     Pair<Option<Dataset<Row>>, String> res = fetchNextBatch(lastCkptStr, sourceLimit);
+    if (overriddenSchemaProvider != null) {
+      return new InputBatch<>(res.getKey(), res.getValue(), overriddenSchemaProvider);
+    }

Review Comment:
   That's actually a valid point. I wrote a unit test with evolving schema. But, it passes even without this change. I think we can hold off landing this PR. Let me investigate more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6817:
URL: https://github.com/apache/hudi/pull/6817#issuecomment-1260626960

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6817:
URL: https://github.com/apache/hudi/pull/6817#issuecomment-1262597501

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11835",
       "triggerID" : "faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e1589ebfa7aea943040a85de3b93a4613b365d83",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e1589ebfa7aea943040a85de3b93a4613b365d83",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * faf303f57b6a1b5e554ec17f2373bbaf3d81ee1d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11835) 
   * e1589ebfa7aea943040a85de3b93a4613b365d83 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #6817:
URL: https://github.com/apache/hudi/pull/6817#discussion_r983086847


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/RowSource.java:
##########
@@ -41,6 +41,9 @@ public RowSource(TypedProperties props, JavaSparkContext sparkContext, SparkSess
   @Override
   protected final InputBatch<Dataset<Row>> fetchNewData(Option<String> lastCkptStr, long sourceLimit) {
     Pair<Option<Dataset<Row>>, String> res = fetchNextBatch(lastCkptStr, sourceLimit);
+    if (overriddenSchemaProvider != null) {
+      return new InputBatch<>(res.getKey(), res.getValue(), overriddenSchemaProvider);
+    }

Review Comment:
   `org.apache.hudi.utilities.sources.Source#fetchNext` actually checks and uses `overriddenSchemaProvider`. 
   And `fetchNewData()` is only used in `fetchNext()` . i think some misconfig caused the issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
codope commented on PR #6817:
URL: https://github.com/apache/hudi/pull/6817#issuecomment-1261122948

   @nsivabalan Can you please review this? I am yet to add a unit test but I have tested with my local confluent schema registry setup. The main issue is that if a schema provider is overridden then RowSource does not take it into consideration. It simply fetched the schema based on `Row`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #6817:
URL: https://github.com/apache/hudi/pull/6817#issuecomment-1283595060

   @codope : whats the status of this PR. do we need this anymore. if not, do we still need to rootcause the original issue then ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org