You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/24 12:19:21 UTC

[GitHub] [incubator-hudi] pratyakshsharma opened a new pull request #1558: [HUDI-796]: added deduping logic for upserts case

pratyakshsharma opened a new pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Added deduping logic for upserts case. 
   
   ## Brief change log
   
   1. Added logic in DedupeSparkJob for upserts case
   2. Made dryRun and useCommitTimeForDedupe parameters configurable in RepairsCommand. 
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629688282


   @yanghua I am unable to run integration tests defined in hudi-cli package on my local. One of the tests from ITTestRepairsCommand is continuously failing in travis build. Need help here. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630856735


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
   > Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `55.15%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #1558       +/-   ##
   =============================================
   - Coverage     71.75%   16.59%   -55.16%     
   + Complexity     1089      799      -290     
   =============================================
     Files           385      344       -41     
     Lines         16599    15172     -1427     
     Branches       1668     1512      -156     
   =============================================
   - Hits          11910     2518     -9392     
   - Misses         3962    12322     +8360     
   + Partials        727      332      -395     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [309 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...5dcdd41](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629699444


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
   > Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `55.03%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #1558       +/-   ##
   =============================================
   - Coverage     71.75%   16.71%   -55.04%     
   + Complexity     1089      795      -294     
   =============================================
     Files           385      340       -45     
     Lines         16599    15030     -1569     
     Branches       1668     1499      -169     
   =============================================
   - Hits          11910     2512     -9398     
   - Misses         3962    12188     +8226     
   + Partials        727      330      -397     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [305 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...838382a](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] yanghua commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-628985363


   > I will let @yanghua see this home
   
   OK, and @pratyakshsharma first of all, please fix all the conflicting files.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629700671


   Wondering why is this codecov report all red. Why is it into account files which are not modified in this PR? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] yanghua commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629263967


   @pratyakshsharma still conflicting files


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-632813623


   you can hop onto travis and click restart.. CI has been pretty stable for a while now.. Do you know what the error is 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r416469107



##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala
##########
@@ -103,24 +105,51 @@ class DedupeSparkJob(basePath: String,
     // Mark all files except the one with latest commits for deletion
     dupeMap.foreach(rt => {
       val (key, rows) = rt
-      var maxCommit = -1L
-
-      rows.foreach(r => {
-        val c = r(3).asInstanceOf[String].toLong
-        if (c > maxCommit)
-          maxCommit = c
-      })
-
-      rows.foreach(r => {
-        val c = r(3).asInstanceOf[String].toLong
-        if (c != maxCommit) {
-          val f = r(2).asInstanceOf[String].split("_")(0)
-          if (!fileToDeleteKeyMap.contains(f)) {
-            fileToDeleteKeyMap(f) = HashSet[String]()
+
+      if (useCommitTimeForDedupe) {
+        /*
+        This corresponds to the case where duplicates got created due to INSERT and have never been updated.
+         */
+        var maxCommit = -1L
+
+        rows.foreach(r => {
+          val c = r(3).asInstanceOf[String].toLong
+          if (c > maxCommit)
+            maxCommit = c
+        })
+        rows.foreach(r => {
+          val c = r(3).asInstanceOf[String].toLong
+          if (c != maxCommit) {
+            val f = r(2).asInstanceOf[String].split("_")(0)
+            if (!fileToDeleteKeyMap.contains(f)) {
+              fileToDeleteKeyMap(f) = HashSet[String]()
+            }
+            fileToDeleteKeyMap(f).add(key)
           }
-          fileToDeleteKeyMap(f).add(key)
+        })
+      } else {
+        /*
+        This corresponds to the case where duplicates have been updated at least once.
+        Once updated, duplicates are bound to have same commit time unless forcefully modified.
+         */
+        val size = rows.size - 1
+        var i = 0
+        val loop = new Breaks
+        loop.breakable {

Review comment:
       Right. Thank you for suggesting this, I am not hands on at scala properly. :) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-commenter commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630856735


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
   > Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `55.15%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #1558       +/-   ##
   =============================================
   - Coverage     71.75%   16.59%   -55.16%     
   + Complexity     1089      799      -290     
   =============================================
     Files           385      344       -41     
     Lines         16599    15172     -1427     
     Branches       1668     1512      -156     
   =============================================
   - Hits          11910     2518     -9392     
   - Misses         3962    12322     +8360     
   + Partials        727      332      -395     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [309 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...5dcdd41](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-632858646


   > Do you know what the error is
   
   There is some error while forking as below -> 
   
   [ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
   [ERROR] The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
   [ERROR] Command was /bin/sh -c cd /home/travis/build/apache/incubator-hudi/hudi-client && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2g -jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire/surefirebooter4462331339368597651.jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire 2020-05-22T14-44-50_350-jvmRun1 surefire3441656546543475644tmp surefire_33480398663035356346tmp
   [ERROR] Error occurred in starting fork, check output in log
   [ERROR] Process Exit Code: 134
   [ERROR] Crashed tests:
   [ERROR] org.apache.hudi.client.TestTableSchemaEvolution
   [ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
   [ERROR] Command was /bin/sh -c cd /home/travis/build/apache/incubator-hudi/hudi-client && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2g -jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire/surefirebooter4462331339368597651.jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire 2020-05-22T14-44-50_350-jvmRun1 surefire3441656546543475644tmp surefire_33480398663035356346tmp
   [ERROR] Error occurred in starting fork, check output in log
   [ERROR] Process Exit Code: 134
   
   @vinothchandar 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] hddong commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
hddong commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r416299956



##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##########
@@ -64,11 +64,15 @@ public String deduplicate(
       @CliOption(key = {"repairedOutputPath"}, help = "Location to place the repaired files",
           mandatory = true) final String repairedOutputPath,
       @CliOption(key = {"sparkProperties"}, help = "Spark Properties File Path",
-          mandatory = true) final String sparkPropertiesPath)
+          mandatory = true) final String sparkPropertiesPath,
+      @CliOption(key = {"useCommitTimeForDedupe"}, help = "Set it to true if duplicates have never been updated",
+        unspecifiedDefaultValue = "true") final boolean useCommitTimeForDedupe,
+      @CliOption(key = {"dryrun"}, help = "Should we actually add or just print what would be done",

Review comment:
       Let us modify help string of dryrun, statements are inaccurate :)

##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala
##########
@@ -103,24 +105,51 @@ class DedupeSparkJob(basePath: String,
     // Mark all files except the one with latest commits for deletion
     dupeMap.foreach(rt => {
       val (key, rows) = rt
-      var maxCommit = -1L
-
-      rows.foreach(r => {
-        val c = r(3).asInstanceOf[String].toLong
-        if (c > maxCommit)
-          maxCommit = c
-      })
-
-      rows.foreach(r => {
-        val c = r(3).asInstanceOf[String].toLong
-        if (c != maxCommit) {
-          val f = r(2).asInstanceOf[String].split("_")(0)
-          if (!fileToDeleteKeyMap.contains(f)) {
-            fileToDeleteKeyMap(f) = HashSet[String]()
+
+      if (useCommitTimeForDedupe) {
+        /*
+        This corresponds to the case where duplicates got created due to INSERT and have never been updated.
+         */
+        var maxCommit = -1L
+
+        rows.foreach(r => {
+          val c = r(3).asInstanceOf[String].toLong
+          if (c > maxCommit)
+            maxCommit = c
+        })
+        rows.foreach(r => {
+          val c = r(3).asInstanceOf[String].toLong
+          if (c != maxCommit) {
+            val f = r(2).asInstanceOf[String].split("_")(0)
+            if (!fileToDeleteKeyMap.contains(f)) {
+              fileToDeleteKeyMap(f) = HashSet[String]()
+            }
+            fileToDeleteKeyMap(f).add(key)
           }
-          fileToDeleteKeyMap(f).add(key)
+        })
+      } else {
+        /*
+        This corresponds to the case where duplicates have been updated at least once.
+        Once updated, duplicates are bound to have same commit time unless forcefully modified.
+         */
+        val size = rows.size - 1
+        var i = 0
+        val loop = new Breaks
+        loop.breakable {

Review comment:
       It's better not use break here, `rows.init` also can get the rows will be delete.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] hddong commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
hddong commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630131586


   @prashantwason : Had send you message in slack, can I hava a look of your log?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] hddong commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
hddong commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r428056297



##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##########
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String basePath, String tableNa
   }
 
   private static int deduplicatePartitionPath(JavaSparkContext jsc, String duplicatedPartitionPath,
-      String repairedOutputPath, String basePath, String dryRun) {
+      String repairedOutputPath, String basePath, boolean dryRun, String dedupeType) {
     DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, repairedOutputPath, new SQLContext(jsc),
-        FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
-    job.fixDuplicates(Boolean.parseBoolean(dryRun));
+        FSUtils.getFs(basePath, jsc.hadoopConfiguration()), getDedupeType(dedupeType));
+    job.fixDuplicates(dryRun);
     return 0;
   }
 
+  private static Enumeration.Value getDedupeType(String type) {
+    switch (type) {
+      case "insertType":
+        return DeDupeType.insertType();
+      case "updateType":
+        return DeDupeType.updateType();
+      case "upsertType":
+        return DeDupeType.upsertType();
+      default:
+        throw new IllegalArgumentException("Please provide valid dedupe type!");
+    }
+  }
+

Review comment:
       Can use `DeDupeType.withName("insertType")` instead.

##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DeDupeType.scala
##########
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli
+
+object DeDupeType extends Enumeration {
+
+  type dedupeType = Value
+
+  val insertType = Value("insertType")
+  val updateType = Value("updateType")
+  val upsertType = Value("upsertType")

Review comment:
       Can we make it all uppercase to keep the format uniform
   https://github.com/apache/incubator-hudi/blob/74ecc27e920c70fa4598d8e5a696954203a5b127/hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java#L30-L34

##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##########
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String basePath, String tableNa
   }
 
   private static int deduplicatePartitionPath(JavaSparkContext jsc, String duplicatedPartitionPath,
-      String repairedOutputPath, String basePath, String dryRun) {
+      String repairedOutputPath, String basePath, boolean dryRun, String dedupeType) {
     DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, repairedOutputPath, new SQLContext(jsc),
-        FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
-    job.fixDuplicates(Boolean.parseBoolean(dryRun));
+        FSUtils.getFs(basePath, jsc.hadoopConfiguration()), getDedupeType(dedupeType));
+    job.fixDuplicates(dryRun);
     return 0;
   }
 
+  private static Enumeration.Value getDedupeType(String type) {
+    switch (type) {
+      case "insertType":
+        return DeDupeType.insertType();
+      case "updateType":
+        return DeDupeType.updateType();
+      case "upsertType":
+        return DeDupeType.upsertType();
+      default:
+        throw new IllegalArgumentException("Please provide valid dedupe type!");
+    }
+  }
+

Review comment:
       Can use `DeDupeType.withName("insertType")` instead?

##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##########
@@ -77,7 +77,9 @@ public String deduplicate(
           help = "Spark executor memory") final String sparkMemory,
       @CliOption(key = {"dryrun"},
           help = "Should we actually remove duplicates or just run and store result to repairedOutputPath",
-          unspecifiedDefaultValue = "true") final boolean dryRun)
+          unspecifiedDefaultValue = "true") final boolean dryRun,
+      @CliOption(key = {"dedupeType"}, help = "Check DeDupeType.scala for valid values",
+          unspecifiedDefaultValue = "insertType") final String dedupeType)

Review comment:
       It's better to show the three types in help string and have a type check at first line of command. 

##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala
##########
@@ -98,34 +97,92 @@ class DedupeSparkJob(basePath: String,
         ON h.`_hoodie_record_key` = d.dupe_key
                       """
     val dupeMap = sqlContext.sql(dupeDataSql).collectAsList().groupBy(r => r.getString(0))
-    val fileToDeleteKeyMap = new HashMap[String, HashSet[String]]()
+    getDedupePlan(dupeMap)
+  }
 
-    // Mark all files except the one with latest commits for deletion
+  private def getDedupePlan(dupeMap: Map[String, Buffer[Row]]): HashMap[String, HashSet[String]] = {
+    val fileToDeleteKeyMap = new HashMap[String, HashSet[String]]()
     dupeMap.foreach(rt => {
       val (key, rows) = rt
-      var maxCommit = -1L
-
-      rows.foreach(r => {
-        val c = r(3).asInstanceOf[String].toLong
-        if (c > maxCommit)
-          maxCommit = c
-      })
-
-      rows.foreach(r => {
-        val c = r(3).asInstanceOf[String].toLong
-        if (c != maxCommit) {
-          val f = r(2).asInstanceOf[String].split("_")(0)
-          if (!fileToDeleteKeyMap.contains(f)) {
-            fileToDeleteKeyMap(f) = HashSet[String]()
-          }
-          fileToDeleteKeyMap(f).add(key)
-        }
-      })
+
+      dedupeType match {
+        case DeDupeType.updateType =>
+          /*
+          This corresponds to the case where all duplicates have been updated at least once.
+          Once updated, duplicates are bound to have same commit time unless forcefully modified.
+          */
+          rows.init.foreach(r => {
+            val f = r(2).asInstanceOf[String].split("_")(0)
+            if (!fileToDeleteKeyMap.contains(f)) {
+              fileToDeleteKeyMap(f) = HashSet[String]()
+            }
+            fileToDeleteKeyMap(f).add(key)
+          })
+        case DeDupeType.insertType =>
+          /*
+          This corresponds to the case where duplicates got created due to INSERT and have never been updated.
+          */
+          var maxCommit = -1L
+
+          rows.foreach(r => {
+            val c = r(3).asInstanceOf[String].toLong
+            if (c > maxCommit)
+              maxCommit = c
+          })
+          rows.foreach(r => {
+            val c = r(3).asInstanceOf[String].toLong
+            if (c != maxCommit) {
+              val f = r(2).asInstanceOf[String].split("_")(0)
+              if (!fileToDeleteKeyMap.contains(f)) {
+                fileToDeleteKeyMap(f) = HashSet[String]()
+              }
+              fileToDeleteKeyMap(f).add(key)
+            }
+          })
+
+        case DeDupeType.upsertType =>
+          /*
+          This corresponds to the case where duplicates got created as a result of inserts as well as updates,
+          i.e few duplicate records have been updated, while others were never updated.
+           */
+          var maxCommit = -1L
+
+          rows.foreach(r => {
+            val c = r(3).asInstanceOf[String].toLong
+            if (c > maxCommit)
+              maxCommit = c
+          })
+          val rowsWithMaxCommit = new ListBuffer[Row]()
+          rows.foreach(r => {
+            val c = r(3).asInstanceOf[String].toLong
+            if (c != maxCommit) {
+              val f = r(2).asInstanceOf[String].split("_")(0)
+              if (!fileToDeleteKeyMap.contains(f)) {
+                fileToDeleteKeyMap(f) = HashSet[String]()
+              }
+              fileToDeleteKeyMap(f).add(key)
+            } else {
+              rowsWithMaxCommit += r
+            }
+          })
+
+          rowsWithMaxCommit.toList.init.foreach(r => {
+            val f = r(2).asInstanceOf[String].split("_")(0)
+            if (!fileToDeleteKeyMap.contains(f)) {
+              fileToDeleteKeyMap(f) = HashSet[String]()
+            }
+            fileToDeleteKeyMap(f).add(key)
+          })
+
+        case _ => throw new IllegalArgumentException("Please provide valid type for deduping!")
+      }
     })
+    LOG.debug("fileToDeleteKeyMap size : " + fileToDeleteKeyMap.size + ", map: " + fileToDeleteKeyMap)

Review comment:
       Can we use `$` to get value? like:
   https://github.com/apache/incubator-hudi/blob/74ecc27e920c70fa4598d8e5a696954203a5b127/hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala#L144




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-620226161


   cc @yanghua @hddong are you able to review this? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-621253677


   > @pratyakshsharma : Can we do two kinds of de-duplicate together? First, de-duplicate by different commits, then de-duplicate same commit, they are not incompatible, we can remove all the duplicate data at once.
   
   Yes it is possible to do both types of deduplicate together. But I feel in most of the cases, only one of the 2 possible cases (de-duplicate by different commits or de-duplicate same commit) would be needed for doing deduping. In essence, that would mean one of the 2 flows would be running unnecessarily. That was the sole reason I introduced the boolean useCommitTimeForDedupe.
   So I prefer to have the code flow in its current form. If you strongly feel, I would do the changes that you are suggesting. Please let me know your thoughts on this. :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r429177537



##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala
##########
@@ -98,34 +97,92 @@ class DedupeSparkJob(basePath: String,
         ON h.`_hoodie_record_key` = d.dupe_key
                       """
     val dupeMap = sqlContext.sql(dupeDataSql).collectAsList().groupBy(r => r.getString(0))
-    val fileToDeleteKeyMap = new HashMap[String, HashSet[String]]()
+    getDedupePlan(dupeMap)
+  }
 
-    // Mark all files except the one with latest commits for deletion
+  private def getDedupePlan(dupeMap: Map[String, Buffer[Row]]): HashMap[String, HashSet[String]] = {
+    val fileToDeleteKeyMap = new HashMap[String, HashSet[String]]()
     dupeMap.foreach(rt => {
       val (key, rows) = rt
-      var maxCommit = -1L
-
-      rows.foreach(r => {
-        val c = r(3).asInstanceOf[String].toLong
-        if (c > maxCommit)
-          maxCommit = c
-      })
-
-      rows.foreach(r => {
-        val c = r(3).asInstanceOf[String].toLong
-        if (c != maxCommit) {
-          val f = r(2).asInstanceOf[String].split("_")(0)
-          if (!fileToDeleteKeyMap.contains(f)) {
-            fileToDeleteKeyMap(f) = HashSet[String]()
-          }
-          fileToDeleteKeyMap(f).add(key)
-        }
-      })
+
+      dedupeType match {
+        case DeDupeType.updateType =>
+          /*
+          This corresponds to the case where all duplicates have been updated at least once.
+          Once updated, duplicates are bound to have same commit time unless forcefully modified.
+          */
+          rows.init.foreach(r => {
+            val f = r(2).asInstanceOf[String].split("_")(0)
+            if (!fileToDeleteKeyMap.contains(f)) {
+              fileToDeleteKeyMap(f) = HashSet[String]()
+            }
+            fileToDeleteKeyMap(f).add(key)
+          })
+        case DeDupeType.insertType =>
+          /*
+          This corresponds to the case where duplicates got created due to INSERT and have never been updated.
+          */
+          var maxCommit = -1L
+
+          rows.foreach(r => {
+            val c = r(3).asInstanceOf[String].toLong
+            if (c > maxCommit)
+              maxCommit = c
+          })
+          rows.foreach(r => {
+            val c = r(3).asInstanceOf[String].toLong
+            if (c != maxCommit) {
+              val f = r(2).asInstanceOf[String].split("_")(0)
+              if (!fileToDeleteKeyMap.contains(f)) {
+                fileToDeleteKeyMap(f) = HashSet[String]()
+              }
+              fileToDeleteKeyMap(f).add(key)
+            }
+          })
+
+        case DeDupeType.upsertType =>
+          /*
+          This corresponds to the case where duplicates got created as a result of inserts as well as updates,
+          i.e few duplicate records have been updated, while others were never updated.
+           */
+          var maxCommit = -1L
+
+          rows.foreach(r => {
+            val c = r(3).asInstanceOf[String].toLong
+            if (c > maxCommit)
+              maxCommit = c
+          })
+          val rowsWithMaxCommit = new ListBuffer[Row]()
+          rows.foreach(r => {
+            val c = r(3).asInstanceOf[String].toLong
+            if (c != maxCommit) {
+              val f = r(2).asInstanceOf[String].split("_")(0)
+              if (!fileToDeleteKeyMap.contains(f)) {
+                fileToDeleteKeyMap(f) = HashSet[String]()
+              }
+              fileToDeleteKeyMap(f).add(key)
+            } else {
+              rowsWithMaxCommit += r
+            }
+          })
+
+          rowsWithMaxCommit.toList.init.foreach(r => {
+            val f = r(2).asInstanceOf[String].split("_")(0)
+            if (!fileToDeleteKeyMap.contains(f)) {
+              fileToDeleteKeyMap(f) = HashSet[String]()
+            }
+            fileToDeleteKeyMap(f).add(key)
+          })
+
+        case _ => throw new IllegalArgumentException("Please provide valid type for deduping!")
+      }
     })
+    LOG.debug("fileToDeleteKeyMap size : " + fileToDeleteKeyMap.size + ", map: " + fileToDeleteKeyMap)

Review comment:
       Done. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630856735


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
   > Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `53.41%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #1558       +/-   ##
   =============================================
   - Coverage     71.75%   18.33%   -53.42%     
   + Complexity     1089      855      -234     
   =============================================
     Files           385      344       -41     
     Lines         16599    15167     -1432     
     Branches       1668     1512      -156     
   =============================================
   - Hits          11910     2781     -9129     
   - Misses         3962    12033     +8071     
   + Partials        727      353      -374     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [307 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...72f850b](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-io commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629699444


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
   > Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `55.03%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #1558       +/-   ##
   =============================================
   - Coverage     71.75%   16.71%   -55.04%     
   + Complexity     1089      795      -294     
   =============================================
     Files           385      340       -45     
     Lines         16599    15030     -1569     
     Branches       1668     1499      -169     
   =============================================
   - Hits          11910     2512     -9398     
   - Misses         3962    12188     +8226     
   + Partials        727      330      -397     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [305 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...838382a](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-628761300


   I will let @yanghua see this home 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r416462293



##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##########
@@ -64,11 +64,15 @@ public String deduplicate(
       @CliOption(key = {"repairedOutputPath"}, help = "Location to place the repaired files",
           mandatory = true) final String repairedOutputPath,
       @CliOption(key = {"sparkProperties"}, help = "Spark Properties File Path",
-          mandatory = true) final String sparkPropertiesPath)
+          mandatory = true) final String sparkPropertiesPath,
+      @CliOption(key = {"useCommitTimeForDedupe"}, help = "Set it to true if duplicates have never been updated",
+        unspecifiedDefaultValue = "true") final boolean useCommitTimeForDedupe,
+      @CliOption(key = {"dryrun"}, help = "Should we actually add or just print what would be done",

Review comment:
       Done. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630299786


   @hddong shared the logs with you over slack. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r429174118



##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##########
@@ -77,7 +77,9 @@ public String deduplicate(
           help = "Spark executor memory") final String sparkMemory,
       @CliOption(key = {"dryrun"},
           help = "Should we actually remove duplicates or just run and store result to repairedOutputPath",
-          unspecifiedDefaultValue = "true") final boolean dryRun)
+          unspecifiedDefaultValue = "true") final boolean dryRun,
+      @CliOption(key = {"dedupeType"}, help = "Check DeDupeType.scala for valid values",
+          unspecifiedDefaultValue = "insertType") final String dedupeType)

Review comment:
       Done. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r429176562



##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DeDupeType.scala
##########
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli
+
+object DeDupeType extends Enumeration {
+
+  type dedupeType = Value
+
+  val insertType = Value("insertType")
+  val updateType = Value("updateType")
+  val upsertType = Value("upsertType")

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] hddong commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
hddong commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r429251126



##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##########
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String basePath, String tableNa
   }
 
   private static int deduplicatePartitionPath(JavaSparkContext jsc, String duplicatedPartitionPath,
-      String repairedOutputPath, String basePath, String dryRun) {
+      String repairedOutputPath, String basePath, boolean dryRun, String dedupeType) {
     DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, repairedOutputPath, new SQLContext(jsc),
-        FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
-    job.fixDuplicates(Boolean.parseBoolean(dryRun));
+        FSUtils.getFs(basePath, jsc.hadoopConfiguration()), getDedupeType(dedupeType));
+    job.fixDuplicates(dryRun);
     return 0;
   }
 
+  private static Enumeration.Value getDedupeType(String type) {
+    switch (type) {
+      case "insertType":
+        return DeDupeType.insertType();
+      case "updateType":
+        return DeDupeType.updateType();
+      case "upsertType":
+        return DeDupeType.upsertType();
+      default:
+        throw new IllegalArgumentException("Please provide valid dedupe type!");
+    }
+  }
+

Review comment:
       @pratyakshsharma : I mean that we can use `DeDupeType.withName("insertType")` to convert `String` to `Enum`.  `getDedupeType` Function may not need here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r429171990



##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##########
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String basePath, String tableNa
   }
 
   private static int deduplicatePartitionPath(JavaSparkContext jsc, String duplicatedPartitionPath,
-      String repairedOutputPath, String basePath, String dryRun) {
+      String repairedOutputPath, String basePath, boolean dryRun, String dedupeType) {
     DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, repairedOutputPath, new SQLContext(jsc),
-        FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
-    job.fixDuplicates(Boolean.parseBoolean(dryRun));
+        FSUtils.getFs(basePath, jsc.hadoopConfiguration()), getDedupeType(dedupeType));
+    job.fixDuplicates(dryRun);
     return 0;
   }
 
+  private static Enumeration.Value getDedupeType(String type) {
+    switch (type) {
+      case "insertType":
+        return DeDupeType.insertType();
+      case "updateType":
+        return DeDupeType.updateType();
+      case "upsertType":
+        return DeDupeType.upsertType();
+      default:
+        throw new IllegalArgumentException("Please provide valid dedupe type!");
+    }
+  }
+

Review comment:
       But what difference does it create? 
   
   `DeDupeType.insertType()` and `DeDupeType.withName("insertType")` - both return the same Value. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-618977148


   @vinothchandar Please have a look. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630860816


   @yanghua @hddong Please take a pass. This is ready for review. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] yanghua commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629915198


   > @yanghua I am unable to run integration tests defined in hudi-cli package on my local. One of the tests from ITTestRepairsCommand is continuously failing in travis build. Need help here.
   
   @hddong Can you help to verify it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] hddong commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
hddong commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629921612


   @yanghua : Sure, I'll discuss with @pratyakshsharma to make it success.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] hddong commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
hddong commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r428056297



##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##########
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String basePath, String tableNa
   }
 
   private static int deduplicatePartitionPath(JavaSparkContext jsc, String duplicatedPartitionPath,
-      String repairedOutputPath, String basePath, String dryRun) {
+      String repairedOutputPath, String basePath, boolean dryRun, String dedupeType) {
     DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, repairedOutputPath, new SQLContext(jsc),
-        FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
-    job.fixDuplicates(Boolean.parseBoolean(dryRun));
+        FSUtils.getFs(basePath, jsc.hadoopConfiguration()), getDedupeType(dedupeType));
+    job.fixDuplicates(dryRun);
     return 0;
   }
 
+  private static Enumeration.Value getDedupeType(String type) {
+    switch (type) {
+      case "insertType":
+        return DeDupeType.insertType();
+      case "updateType":
+        return DeDupeType.updateType();
+      case "upsertType":
+        return DeDupeType.upsertType();
+      default:
+        throw new IllegalArgumentException("Please provide valid dedupe type!");
+    }
+  }
+

Review comment:
       Can use `DeDupeType.withName("insertType")` instead.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630856735


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
   > Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `53.41%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #1558       +/-   ##
   =============================================
   - Coverage     71.75%   18.33%   -53.42%     
   + Complexity     1089      855      -234     
   =============================================
     Files           385      344       -41     
     Lines         16599    15167     -1432     
     Branches       1668     1512      -156     
   =============================================
   - Hits          11910     2781     -9129     
   - Misses         3962    12033     +8071     
   + Partials        727      353      -374     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [307 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...72f850b](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-632793624


   Wondering why are the tests crashing today in between. This happened twice with me today. Is there any way to re-trigger travis build apart from re-pushing the code? @vinothchandar @yanghua 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org