You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/26 08:32:49 UTC

[GitHub] [hudi] KnightChess opened a new pull request, #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

KnightChess opened a new pull request, #6824:
URL: https://github.com/apache/hudi/pull/6824

   ### Change Logs
   
   merge into only_insert's operation  is consistent with has match action, if table has precombineField, op is upsert otherwise is insert
   
   ### Impact
   
   spark merge into sql
   
   **Risk level: none | low | medium | high**
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1293320620

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3cec3f8cea64be28f341c1e8a9eaf60142d5e037 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #6824:
URL: https://github.com/apache/hudi/pull/6824#discussion_r1000114134


##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##########
@@ -160,7 +167,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie
 
       // column order changed after left anti join , we should keep column order of source dataframe
       val cols = removeMetaFields(sourceDF).columns
-      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), parameters)
+      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), writeParam)

Review Comment:
   @KnightChess i think in the case of merge into, if we can set `hoodie.combine.before.insert` to true when precombine field is set, and keep the operation type still `insert` to align with "when not matched then insert *" where `insert` was used. Changing operation type leads to implementation inconsistency - even the method is called `executeInsertOnly()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1293410770

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613",
       "triggerID" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3cec3f8cea64be28f341c1e8a9eaf60142d5e037 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1261721269

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871",
       "triggerID" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3222f2c9d2e200ade4c3fa7f6b333e79cc298775 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1293135782

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871",
       "triggerID" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12593",
       "triggerID" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613",
       "triggerID" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3cec3f8cea64be28f341c1e8a9eaf60142d5e037 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on a diff in pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
KnightChess commented on code in PR #6824:
URL: https://github.com/apache/hudi/pull/6824#discussion_r1002828098


##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##########
@@ -160,7 +167,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie
 
       // column order changed after left anti join , we should keep column order of source dataframe
       val cols = removeMetaFields(sourceDF).columns
-      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), parameters)
+      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), writeParam)

Review Comment:
   @YannByron sorry, will add these days



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1293509557

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613",
       "triggerID" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3cec3f8cea64be28f341c1e8a9eaf60142d5e037 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1292944401

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871",
       "triggerID" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12593",
       "triggerID" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613",
       "triggerID" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4340e3a6cff3dde74c22911eafadfe346b95f8cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12593) 
   * 3cec3f8cea64be28f341c1e8a9eaf60142d5e037 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess closed pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
KnightChess closed pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…
URL: https://github.com/apache/hudi/pull/6824


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on a diff in pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
KnightChess commented on code in PR #6824:
URL: https://github.com/apache/hudi/pull/6824#discussion_r1002827833


##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##########
@@ -160,7 +167,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie
 
       // column order changed after left anti join , we should keep column order of source dataframe
       val cols = removeMetaFields(sourceDF).columns
-      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), parameters)
+      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), writeParam)

Review Comment:
   @xushiyan I think `executeInsertOnly` and `executeUpsert` is different from hudi op `insert` and `upsert`, just a condition branch for `merge into` sql. And for the SQL Semantic, I think `merge into` shoudl only be used to `upsert` op, and event shoudle not follow the hudi `precombineKey`, because  `merget into` sql has a lot of flexibility to update the record which we want.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1292941556

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871",
       "triggerID" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12593",
       "triggerID" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4340e3a6cff3dde74c22911eafadfe346b95f8cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12593) 
   * 3cec3f8cea64be28f341c1e8a9eaf60142d5e037 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on a diff in pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
YannByron commented on code in PR #6824:
URL: https://github.com/apache/hudi/pull/6824#discussion_r999614340


##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##########
@@ -160,7 +167,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie
 
       // column order changed after left anti join , we should keep column order of source dataframe
       val cols = removeMetaFields(sourceDF).columns
-      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), parameters)
+      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), writeParam)

Review Comment:
   @KnightChess I think this pr wants to guarantee write consistency no matter whether the `when matched then` clause is present or not.
   So maybe better to split this UT to two, one has configured `preCombineField`, and another not. And both of UT contains two cases that have `when matched then` or not. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1261681863

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3222f2c9d2e200ade4c3fa7f6b333e79cc298775 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #6824:
URL: https://github.com/apache/hudi/pull/6824#discussion_r1002955844


##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##########
@@ -160,7 +167,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie
 
       // column order changed after left anti join , we should keep column order of source dataframe
       val cols = removeMetaFields(sourceDF).columns
-      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), parameters)
+      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), writeParam)

Review Comment:
   @KnightChess ok agree with the merge into semantics of doing upsert. To improve code readability, can we actually merge `executeUpsert` and `executeInsertOnly`? we're actually separate the condition flows within `executeUpsert`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan merged pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
xushiyan merged PR #6824:
URL: https://github.com/apache/hudi/pull/6824


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1291692207

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871",
       "triggerID" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3222f2c9d2e200ade4c3fa7f6b333e79cc298775 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871) 
   * 4340e3a6cff3dde74c22911eafadfe346b95f8cd UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1291699769

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871",
       "triggerID" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12593",
       "triggerID" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3222f2c9d2e200ade4c3fa7f6b333e79cc298775 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871) 
   * 4340e3a6cff3dde74c22911eafadfe346b95f8cd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12593) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1293430181

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3cec3f8cea64be28f341c1e8a9eaf60142d5e037 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1293327478

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613",
       "triggerID" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3cec3f8cea64be28f341c1e8a9eaf60142d5e037 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1261836840

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871",
       "triggerID" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3222f2c9d2e200ade4c3fa7f6b333e79cc298775 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1292290189

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11871",
       "triggerID" : "3222f2c9d2e200ade4c3fa7f6b333e79cc298775",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12593",
       "triggerID" : "4340e3a6cff3dde74c22911eafadfe346b95f8cd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4340e3a6cff3dde74c22911eafadfe346b95f8cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12593) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess closed pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
KnightChess closed pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…
URL: https://github.com/apache/hudi/pull/6824


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6824:
URL: https://github.com/apache/hudi/pull/6824#issuecomment-1293502379

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613",
       "triggerID" : "3cec3f8cea64be28f341c1e8a9eaf60142d5e037",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3cec3f8cea64be28f341c1e8a9eaf60142d5e037 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12613) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #6824:
URL: https://github.com/apache/hudi/pull/6824#discussion_r999328075


##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##########
@@ -160,7 +167,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie
 
       // column order changed after left anti join , we should keep column order of source dataframe
       val cols = removeMetaFields(sourceDF).columns
-      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), parameters)
+      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), writeParam)

Review Comment:
   this is basically saying if user sets precombine field, we always upsert, even if there's no match. i don't think this is the right semantics. if you want to de-duplicate the incoming records, then use `hoodie.combine.before.insert`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on a diff in pull request #6824: [HUDI-4946] fix merge into with no preCombineField has dup row by onl…

Posted by GitBox <gi...@apache.org>.
KnightChess commented on code in PR #6824:
URL: https://github.com/apache/hudi/pull/6824#discussion_r999401951


##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##########
@@ -160,7 +167,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie
 
       // column order changed after left anti join , we should keep column order of source dataframe
       val cols = removeMetaFields(sourceDF).columns
-      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), parameters)
+      executeInsertOnly(insertSourceDF.select(cols.head, cols.tail:_*), writeParam)

Review Comment:
   yes, use `hoodie.combine.before.insert` will de-duplicate, but this is not friendly to users.
   When create a table with precombine field and use merge into sql to upsert data, it may be prod duplicate records if user wirte diff merge sql. if user need solve it, we need set `hoodie.combine.before.insert` in one case which only has  no match branch. User will have doubt, a table with precombineKey in merge sql, sometime writing effect is `upsert` and sometime `insert`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org