You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/24 12:19:21 UTC
[GitHub] [incubator-hudi] pratyakshsharma opened a new pull request #1558: [HUDI-796]: added deduping logic for upserts case
pratyakshsharma opened a new pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558
## *Tips*
- *Thank you very much for contributing to Apache Hudi.*
- *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
## What is the purpose of the pull request
Added deduping logic for upserts case.
## Brief change log
1. Added logic in DedupeSparkJob for upserts case
2. Made dryRun and useCommitTimeForDedupe parameters configurable in RepairsCommand.
## Verify this pull request
*(Please pick either of the following options)*
This pull request is a trivial rework / code cleanup without any test coverage.
*(or)*
This pull request is already covered by existing tests, such as *(please describe tests)*.
(or)
This change added tests and can be verified as follows:
*(example:)*
- *Added integration tests for end-to-end.*
- *Added HoodieClientWriteTest to verify the change.*
- *Manually verified the change by running a job locally.*
## Committer checklist
- [ ] Has a corresponding JIRA in PR title & commit
- [ ] Commit message is descriptive of the change
- [ ] CI is green
- [ ] Necessary doc changes done or have another open PR
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629688282
@yanghua I am unable to run integration tests defined in hudi-cli package on my local. One of the tests from ITTestRepairsCommand is continuously failing in travis build. Need help here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630856735
# [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
> Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `55.15%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #1558 +/- ##
=============================================
- Coverage 71.75% 16.59% -55.16%
+ Complexity 1089 799 -290
=============================================
Files 385 344 -41
Lines 16599 15172 -1427
Branches 1668 1512 -156
=============================================
- Hits 11910 2518 -9392
- Misses 3962 12322 +8360
+ Partials 727 332 -395
```
| [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| ... and [309 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...5dcdd41](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629699444
# [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
> Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `55.03%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #1558 +/- ##
=============================================
- Coverage 71.75% 16.71% -55.04%
+ Complexity 1089 795 -294
=============================================
Files 385 340 -45
Lines 16599 15030 -1569
Branches 1668 1499 -169
=============================================
- Hits 11910 2512 -9398
- Misses 3962 12188 +8226
+ Partials 727 330 -397
```
| [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| ... and [305 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...838382a](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] yanghua commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-628985363
> I will let @yanghua see this home
OK, and @pratyakshsharma first of all, please fix all the conflicting files.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629700671
Wondering why is this codecov report all red. Why is it into account files which are not modified in this PR?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] yanghua commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629263967
@pratyakshsharma still conflicting files
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] vinothchandar commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-632813623
you can hop onto travis and click restart.. CI has been pretty stable for a while now.. Do you know what the error is
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r416469107
##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala
##########
@@ -103,24 +105,51 @@ class DedupeSparkJob(basePath: String,
// Mark all files except the one with latest commits for deletion
dupeMap.foreach(rt => {
val (key, rows) = rt
- var maxCommit = -1L
-
- rows.foreach(r => {
- val c = r(3).asInstanceOf[String].toLong
- if (c > maxCommit)
- maxCommit = c
- })
-
- rows.foreach(r => {
- val c = r(3).asInstanceOf[String].toLong
- if (c != maxCommit) {
- val f = r(2).asInstanceOf[String].split("_")(0)
- if (!fileToDeleteKeyMap.contains(f)) {
- fileToDeleteKeyMap(f) = HashSet[String]()
+
+ if (useCommitTimeForDedupe) {
+ /*
+ This corresponds to the case where duplicates got created due to INSERT and have never been updated.
+ */
+ var maxCommit = -1L
+
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c > maxCommit)
+ maxCommit = c
+ })
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c != maxCommit) {
+ val f = r(2).asInstanceOf[String].split("_")(0)
+ if (!fileToDeleteKeyMap.contains(f)) {
+ fileToDeleteKeyMap(f) = HashSet[String]()
+ }
+ fileToDeleteKeyMap(f).add(key)
}
- fileToDeleteKeyMap(f).add(key)
+ })
+ } else {
+ /*
+ This corresponds to the case where duplicates have been updated at least once.
+ Once updated, duplicates are bound to have same commit time unless forcefully modified.
+ */
+ val size = rows.size - 1
+ var i = 0
+ val loop = new Breaks
+ loop.breakable {
Review comment:
Right. Thank you for suggesting this, I am not hands on at scala properly. :)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] codecov-commenter commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630856735
# [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
> Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `55.15%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #1558 +/- ##
=============================================
- Coverage 71.75% 16.59% -55.16%
+ Complexity 1089 799 -290
=============================================
Files 385 344 -41
Lines 16599 15172 -1427
Branches 1668 1512 -156
=============================================
- Hits 11910 2518 -9392
- Misses 3962 12322 +8360
+ Partials 727 332 -395
```
| [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| ... and [309 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...5dcdd41](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-632858646
> Do you know what the error is
There is some error while forking as below ->
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /home/travis/build/apache/incubator-hudi/hudi-client && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2g -jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire/surefirebooter4462331339368597651.jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire 2020-05-22T14-44-50_350-jvmRun1 surefire3441656546543475644tmp surefire_33480398663035356346tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 134
[ERROR] Crashed tests:
[ERROR] org.apache.hudi.client.TestTableSchemaEvolution
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /home/travis/build/apache/incubator-hudi/hudi-client && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2g -jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire/surefirebooter4462331339368597651.jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire 2020-05-22T14-44-50_350-jvmRun1 surefire3441656546543475644tmp surefire_33480398663035356346tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 134
@vinothchandar
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] hddong commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
hddong commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r416299956
##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##########
@@ -64,11 +64,15 @@ public String deduplicate(
@CliOption(key = {"repairedOutputPath"}, help = "Location to place the repaired files",
mandatory = true) final String repairedOutputPath,
@CliOption(key = {"sparkProperties"}, help = "Spark Properties File Path",
- mandatory = true) final String sparkPropertiesPath)
+ mandatory = true) final String sparkPropertiesPath,
+ @CliOption(key = {"useCommitTimeForDedupe"}, help = "Set it to true if duplicates have never been updated",
+ unspecifiedDefaultValue = "true") final boolean useCommitTimeForDedupe,
+ @CliOption(key = {"dryrun"}, help = "Should we actually add or just print what would be done",
Review comment:
Let us modify help string of dryrun, statements are inaccurate :)
##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala
##########
@@ -103,24 +105,51 @@ class DedupeSparkJob(basePath: String,
// Mark all files except the one with latest commits for deletion
dupeMap.foreach(rt => {
val (key, rows) = rt
- var maxCommit = -1L
-
- rows.foreach(r => {
- val c = r(3).asInstanceOf[String].toLong
- if (c > maxCommit)
- maxCommit = c
- })
-
- rows.foreach(r => {
- val c = r(3).asInstanceOf[String].toLong
- if (c != maxCommit) {
- val f = r(2).asInstanceOf[String].split("_")(0)
- if (!fileToDeleteKeyMap.contains(f)) {
- fileToDeleteKeyMap(f) = HashSet[String]()
+
+ if (useCommitTimeForDedupe) {
+ /*
+ This corresponds to the case where duplicates got created due to INSERT and have never been updated.
+ */
+ var maxCommit = -1L
+
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c > maxCommit)
+ maxCommit = c
+ })
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c != maxCommit) {
+ val f = r(2).asInstanceOf[String].split("_")(0)
+ if (!fileToDeleteKeyMap.contains(f)) {
+ fileToDeleteKeyMap(f) = HashSet[String]()
+ }
+ fileToDeleteKeyMap(f).add(key)
}
- fileToDeleteKeyMap(f).add(key)
+ })
+ } else {
+ /*
+ This corresponds to the case where duplicates have been updated at least once.
+ Once updated, duplicates are bound to have same commit time unless forcefully modified.
+ */
+ val size = rows.size - 1
+ var i = 0
+ val loop = new Breaks
+ loop.breakable {
Review comment:
It's better not use break here, `rows.init` also can get the rows will be delete.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] hddong commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
hddong commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630131586
@prashantwason : Had send you message in slack, can I hava a look of your log?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] hddong commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
hddong commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r428056297
##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##########
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String basePath, String tableNa
}
private static int deduplicatePartitionPath(JavaSparkContext jsc, String duplicatedPartitionPath,
- String repairedOutputPath, String basePath, String dryRun) {
+ String repairedOutputPath, String basePath, boolean dryRun, String dedupeType) {
DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, repairedOutputPath, new SQLContext(jsc),
- FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
- job.fixDuplicates(Boolean.parseBoolean(dryRun));
+ FSUtils.getFs(basePath, jsc.hadoopConfiguration()), getDedupeType(dedupeType));
+ job.fixDuplicates(dryRun);
return 0;
}
+ private static Enumeration.Value getDedupeType(String type) {
+ switch (type) {
+ case "insertType":
+ return DeDupeType.insertType();
+ case "updateType":
+ return DeDupeType.updateType();
+ case "upsertType":
+ return DeDupeType.upsertType();
+ default:
+ throw new IllegalArgumentException("Please provide valid dedupe type!");
+ }
+ }
+
Review comment:
Can use `DeDupeType.withName("insertType")` instead.
##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DeDupeType.scala
##########
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli
+
+object DeDupeType extends Enumeration {
+
+ type dedupeType = Value
+
+ val insertType = Value("insertType")
+ val updateType = Value("updateType")
+ val upsertType = Value("upsertType")
Review comment:
Can we make it all uppercase to keep the format uniform
https://github.com/apache/incubator-hudi/blob/74ecc27e920c70fa4598d8e5a696954203a5b127/hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java#L30-L34
##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##########
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String basePath, String tableNa
}
private static int deduplicatePartitionPath(JavaSparkContext jsc, String duplicatedPartitionPath,
- String repairedOutputPath, String basePath, String dryRun) {
+ String repairedOutputPath, String basePath, boolean dryRun, String dedupeType) {
DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, repairedOutputPath, new SQLContext(jsc),
- FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
- job.fixDuplicates(Boolean.parseBoolean(dryRun));
+ FSUtils.getFs(basePath, jsc.hadoopConfiguration()), getDedupeType(dedupeType));
+ job.fixDuplicates(dryRun);
return 0;
}
+ private static Enumeration.Value getDedupeType(String type) {
+ switch (type) {
+ case "insertType":
+ return DeDupeType.insertType();
+ case "updateType":
+ return DeDupeType.updateType();
+ case "upsertType":
+ return DeDupeType.upsertType();
+ default:
+ throw new IllegalArgumentException("Please provide valid dedupe type!");
+ }
+ }
+
Review comment:
Can use `DeDupeType.withName("insertType")` instead?
##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##########
@@ -77,7 +77,9 @@ public String deduplicate(
help = "Spark executor memory") final String sparkMemory,
@CliOption(key = {"dryrun"},
help = "Should we actually remove duplicates or just run and store result to repairedOutputPath",
- unspecifiedDefaultValue = "true") final boolean dryRun)
+ unspecifiedDefaultValue = "true") final boolean dryRun,
+ @CliOption(key = {"dedupeType"}, help = "Check DeDupeType.scala for valid values",
+ unspecifiedDefaultValue = "insertType") final String dedupeType)
Review comment:
It's better to show the three types in help string and have a type check at first line of command.
##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala
##########
@@ -98,34 +97,92 @@ class DedupeSparkJob(basePath: String,
ON h.`_hoodie_record_key` = d.dupe_key
"""
val dupeMap = sqlContext.sql(dupeDataSql).collectAsList().groupBy(r => r.getString(0))
- val fileToDeleteKeyMap = new HashMap[String, HashSet[String]]()
+ getDedupePlan(dupeMap)
+ }
- // Mark all files except the one with latest commits for deletion
+ private def getDedupePlan(dupeMap: Map[String, Buffer[Row]]): HashMap[String, HashSet[String]] = {
+ val fileToDeleteKeyMap = new HashMap[String, HashSet[String]]()
dupeMap.foreach(rt => {
val (key, rows) = rt
- var maxCommit = -1L
-
- rows.foreach(r => {
- val c = r(3).asInstanceOf[String].toLong
- if (c > maxCommit)
- maxCommit = c
- })
-
- rows.foreach(r => {
- val c = r(3).asInstanceOf[String].toLong
- if (c != maxCommit) {
- val f = r(2).asInstanceOf[String].split("_")(0)
- if (!fileToDeleteKeyMap.contains(f)) {
- fileToDeleteKeyMap(f) = HashSet[String]()
- }
- fileToDeleteKeyMap(f).add(key)
- }
- })
+
+ dedupeType match {
+ case DeDupeType.updateType =>
+ /*
+ This corresponds to the case where all duplicates have been updated at least once.
+ Once updated, duplicates are bound to have same commit time unless forcefully modified.
+ */
+ rows.init.foreach(r => {
+ val f = r(2).asInstanceOf[String].split("_")(0)
+ if (!fileToDeleteKeyMap.contains(f)) {
+ fileToDeleteKeyMap(f) = HashSet[String]()
+ }
+ fileToDeleteKeyMap(f).add(key)
+ })
+ case DeDupeType.insertType =>
+ /*
+ This corresponds to the case where duplicates got created due to INSERT and have never been updated.
+ */
+ var maxCommit = -1L
+
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c > maxCommit)
+ maxCommit = c
+ })
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c != maxCommit) {
+ val f = r(2).asInstanceOf[String].split("_")(0)
+ if (!fileToDeleteKeyMap.contains(f)) {
+ fileToDeleteKeyMap(f) = HashSet[String]()
+ }
+ fileToDeleteKeyMap(f).add(key)
+ }
+ })
+
+ case DeDupeType.upsertType =>
+ /*
+ This corresponds to the case where duplicates got created as a result of inserts as well as updates,
+ i.e few duplicate records have been updated, while others were never updated.
+ */
+ var maxCommit = -1L
+
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c > maxCommit)
+ maxCommit = c
+ })
+ val rowsWithMaxCommit = new ListBuffer[Row]()
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c != maxCommit) {
+ val f = r(2).asInstanceOf[String].split("_")(0)
+ if (!fileToDeleteKeyMap.contains(f)) {
+ fileToDeleteKeyMap(f) = HashSet[String]()
+ }
+ fileToDeleteKeyMap(f).add(key)
+ } else {
+ rowsWithMaxCommit += r
+ }
+ })
+
+ rowsWithMaxCommit.toList.init.foreach(r => {
+ val f = r(2).asInstanceOf[String].split("_")(0)
+ if (!fileToDeleteKeyMap.contains(f)) {
+ fileToDeleteKeyMap(f) = HashSet[String]()
+ }
+ fileToDeleteKeyMap(f).add(key)
+ })
+
+ case _ => throw new IllegalArgumentException("Please provide valid type for deduping!")
+ }
})
+ LOG.debug("fileToDeleteKeyMap size : " + fileToDeleteKeyMap.size + ", map: " + fileToDeleteKeyMap)
Review comment:
Can we use `$` to get value? like:
https://github.com/apache/incubator-hudi/blob/74ecc27e920c70fa4598d8e5a696954203a5b127/hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala#L144
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] vinothchandar commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-620226161
cc @yanghua @hddong are you able to review this?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-621253677
> @pratyakshsharma : Can we do two kinds of de-duplicate together? First, de-duplicate by different commits, then de-duplicate same commit, they are not incompatible, we can remove all the duplicate data at once.
Yes it is possible to do both types of deduplicate together. But I feel in most of the cases, only one of the 2 possible cases (de-duplicate by different commits or de-duplicate same commit) would be needed for doing deduping. In essence, that would mean one of the 2 flows would be running unnecessarily. That was the sole reason I introduced the boolean useCommitTimeForDedupe.
So I prefer to have the code flow in its current form. If you strongly feel, I would do the changes that you are suggesting. Please let me know your thoughts on this. :)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r429177537
##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala
##########
@@ -98,34 +97,92 @@ class DedupeSparkJob(basePath: String,
ON h.`_hoodie_record_key` = d.dupe_key
"""
val dupeMap = sqlContext.sql(dupeDataSql).collectAsList().groupBy(r => r.getString(0))
- val fileToDeleteKeyMap = new HashMap[String, HashSet[String]]()
+ getDedupePlan(dupeMap)
+ }
- // Mark all files except the one with latest commits for deletion
+ private def getDedupePlan(dupeMap: Map[String, Buffer[Row]]): HashMap[String, HashSet[String]] = {
+ val fileToDeleteKeyMap = new HashMap[String, HashSet[String]]()
dupeMap.foreach(rt => {
val (key, rows) = rt
- var maxCommit = -1L
-
- rows.foreach(r => {
- val c = r(3).asInstanceOf[String].toLong
- if (c > maxCommit)
- maxCommit = c
- })
-
- rows.foreach(r => {
- val c = r(3).asInstanceOf[String].toLong
- if (c != maxCommit) {
- val f = r(2).asInstanceOf[String].split("_")(0)
- if (!fileToDeleteKeyMap.contains(f)) {
- fileToDeleteKeyMap(f) = HashSet[String]()
- }
- fileToDeleteKeyMap(f).add(key)
- }
- })
+
+ dedupeType match {
+ case DeDupeType.updateType =>
+ /*
+ This corresponds to the case where all duplicates have been updated at least once.
+ Once updated, duplicates are bound to have same commit time unless forcefully modified.
+ */
+ rows.init.foreach(r => {
+ val f = r(2).asInstanceOf[String].split("_")(0)
+ if (!fileToDeleteKeyMap.contains(f)) {
+ fileToDeleteKeyMap(f) = HashSet[String]()
+ }
+ fileToDeleteKeyMap(f).add(key)
+ })
+ case DeDupeType.insertType =>
+ /*
+ This corresponds to the case where duplicates got created due to INSERT and have never been updated.
+ */
+ var maxCommit = -1L
+
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c > maxCommit)
+ maxCommit = c
+ })
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c != maxCommit) {
+ val f = r(2).asInstanceOf[String].split("_")(0)
+ if (!fileToDeleteKeyMap.contains(f)) {
+ fileToDeleteKeyMap(f) = HashSet[String]()
+ }
+ fileToDeleteKeyMap(f).add(key)
+ }
+ })
+
+ case DeDupeType.upsertType =>
+ /*
+ This corresponds to the case where duplicates got created as a result of inserts as well as updates,
+ i.e few duplicate records have been updated, while others were never updated.
+ */
+ var maxCommit = -1L
+
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c > maxCommit)
+ maxCommit = c
+ })
+ val rowsWithMaxCommit = new ListBuffer[Row]()
+ rows.foreach(r => {
+ val c = r(3).asInstanceOf[String].toLong
+ if (c != maxCommit) {
+ val f = r(2).asInstanceOf[String].split("_")(0)
+ if (!fileToDeleteKeyMap.contains(f)) {
+ fileToDeleteKeyMap(f) = HashSet[String]()
+ }
+ fileToDeleteKeyMap(f).add(key)
+ } else {
+ rowsWithMaxCommit += r
+ }
+ })
+
+ rowsWithMaxCommit.toList.init.foreach(r => {
+ val f = r(2).asInstanceOf[String].split("_")(0)
+ if (!fileToDeleteKeyMap.contains(f)) {
+ fileToDeleteKeyMap(f) = HashSet[String]()
+ }
+ fileToDeleteKeyMap(f).add(key)
+ })
+
+ case _ => throw new IllegalArgumentException("Please provide valid type for deduping!")
+ }
})
+ LOG.debug("fileToDeleteKeyMap size : " + fileToDeleteKeyMap.size + ", map: " + fileToDeleteKeyMap)
Review comment:
Done.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630856735
# [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
> Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `53.41%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #1558 +/- ##
=============================================
- Coverage 71.75% 18.33% -53.42%
+ Complexity 1089 855 -234
=============================================
Files 385 344 -41
Lines 16599 15167 -1432
Branches 1668 1512 -156
=============================================
- Hits 11910 2781 -9129
- Misses 3962 12033 +8071
+ Partials 727 353 -374
```
| [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| ... and [307 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...72f850b](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] codecov-io commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629699444
# [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
> Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `55.03%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #1558 +/- ##
=============================================
- Coverage 71.75% 16.71% -55.04%
+ Complexity 1089 795 -294
=============================================
Files 385 340 -45
Lines 16599 15030 -1569
Branches 1668 1499 -169
=============================================
- Hits 11910 2512 -9398
- Misses 3962 12188 +8226
+ Partials 727 330 -397
```
| [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| ... and [305 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...838382a](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] vinothchandar commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-628761300
I will let @yanghua see this home
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r416462293
##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##########
@@ -64,11 +64,15 @@ public String deduplicate(
@CliOption(key = {"repairedOutputPath"}, help = "Location to place the repaired files",
mandatory = true) final String repairedOutputPath,
@CliOption(key = {"sparkProperties"}, help = "Spark Properties File Path",
- mandatory = true) final String sparkPropertiesPath)
+ mandatory = true) final String sparkPropertiesPath,
+ @CliOption(key = {"useCommitTimeForDedupe"}, help = "Set it to true if duplicates have never been updated",
+ unspecifiedDefaultValue = "true") final boolean useCommitTimeForDedupe,
+ @CliOption(key = {"dryrun"}, help = "Should we actually add or just print what would be done",
Review comment:
Done.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630299786
@hddong shared the logs with you over slack.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r429174118
##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##########
@@ -77,7 +77,9 @@ public String deduplicate(
help = "Spark executor memory") final String sparkMemory,
@CliOption(key = {"dryrun"},
help = "Should we actually remove duplicates or just run and store result to repairedOutputPath",
- unspecifiedDefaultValue = "true") final boolean dryRun)
+ unspecifiedDefaultValue = "true") final boolean dryRun,
+ @CliOption(key = {"dedupeType"}, help = "Check DeDupeType.scala for valid values",
+ unspecifiedDefaultValue = "insertType") final String dedupeType)
Review comment:
Done.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r429176562
##########
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/DeDupeType.scala
##########
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli
+
+object DeDupeType extends Enumeration {
+
+ type dedupeType = Value
+
+ val insertType = Value("insertType")
+ val updateType = Value("updateType")
+ val upsertType = Value("upsertType")
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] hddong commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
hddong commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r429251126
##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##########
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String basePath, String tableNa
}
private static int deduplicatePartitionPath(JavaSparkContext jsc, String duplicatedPartitionPath,
- String repairedOutputPath, String basePath, String dryRun) {
+ String repairedOutputPath, String basePath, boolean dryRun, String dedupeType) {
DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, repairedOutputPath, new SQLContext(jsc),
- FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
- job.fixDuplicates(Boolean.parseBoolean(dryRun));
+ FSUtils.getFs(basePath, jsc.hadoopConfiguration()), getDedupeType(dedupeType));
+ job.fixDuplicates(dryRun);
return 0;
}
+ private static Enumeration.Value getDedupeType(String type) {
+ switch (type) {
+ case "insertType":
+ return DeDupeType.insertType();
+ case "updateType":
+ return DeDupeType.updateType();
+ case "upsertType":
+ return DeDupeType.upsertType();
+ default:
+ throw new IllegalArgumentException("Please provide valid dedupe type!");
+ }
+ }
+
Review comment:
@pratyakshsharma : I mean that we can use `DeDupeType.withName("insertType")` to convert `String` to `Enum`. `getDedupeType` Function may not need here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r429171990
##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##########
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String basePath, String tableNa
}
private static int deduplicatePartitionPath(JavaSparkContext jsc, String duplicatedPartitionPath,
- String repairedOutputPath, String basePath, String dryRun) {
+ String repairedOutputPath, String basePath, boolean dryRun, String dedupeType) {
DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, repairedOutputPath, new SQLContext(jsc),
- FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
- job.fixDuplicates(Boolean.parseBoolean(dryRun));
+ FSUtils.getFs(basePath, jsc.hadoopConfiguration()), getDedupeType(dedupeType));
+ job.fixDuplicates(dryRun);
return 0;
}
+ private static Enumeration.Value getDedupeType(String type) {
+ switch (type) {
+ case "insertType":
+ return DeDupeType.insertType();
+ case "updateType":
+ return DeDupeType.updateType();
+ case "upsertType":
+ return DeDupeType.upsertType();
+ default:
+ throw new IllegalArgumentException("Please provide valid dedupe type!");
+ }
+ }
+
Review comment:
But what difference does it create?
`DeDupeType.insertType()` and `DeDupeType.withName("insertType")` - both return the same Value.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-618977148
@vinothchandar Please have a look.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630860816
@yanghua @hddong Please take a pass. This is ready for review.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] yanghua commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629915198
> @yanghua I am unable to run integration tests defined in hudi-cli package on my local. One of the tests from ITTestRepairsCommand is continuously failing in travis build. Need help here.
@hddong Can you help to verify it?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] hddong commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
hddong commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629921612
@yanghua : Sure, I'll discuss with @pratyakshsharma to make it success.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] hddong commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
hddong commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r428056297
##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##########
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String basePath, String tableNa
}
private static int deduplicatePartitionPath(JavaSparkContext jsc, String duplicatedPartitionPath,
- String repairedOutputPath, String basePath, String dryRun) {
+ String repairedOutputPath, String basePath, boolean dryRun, String dedupeType) {
DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, repairedOutputPath, new SQLContext(jsc),
- FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
- job.fixDuplicates(Boolean.parseBoolean(dryRun));
+ FSUtils.getFs(basePath, jsc.hadoopConfiguration()), getDedupeType(dedupeType));
+ job.fixDuplicates(dryRun);
return 0;
}
+ private static Enumeration.Value getDedupeType(String type) {
+ switch (type) {
+ case "insertType":
+ return DeDupeType.insertType();
+ case "updateType":
+ return DeDupeType.updateType();
+ case "upsertType":
+ return DeDupeType.upsertType();
+ default:
+ throw new IllegalArgumentException("Please provide valid dedupe type!");
+ }
+ }
+
Review comment:
Can use `DeDupeType.withName("insertType")` instead.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-630856735
# [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=h1) Report
> Merging [#1558](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253&el=desc) will **decrease** coverage by `53.41%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1558/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #1558 +/- ##
=============================================
- Coverage 71.75% 18.33% -53.42%
+ Complexity 1089 855 -234
=============================================
Files 385 344 -41
Lines 16599 15167 -1432
Branches 1668 1512 -156
=============================================
- Hits 11910 2781 -9129
- Misses 3962 12033 +8071
+ Partials 727 353 -374
```
| [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| [...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
| ... and [307 more](https://codecov.io/gh/apache/incubator-hudi/pull/1558/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=footer). Last update [a64afdf...72f850b](https://codecov.io/gh/apache/incubator-hudi/pull/1558?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-632793624
Wondering why are the tests crashing today in between. This happened twice with me today. Is there any way to re-trigger travis build apart from re-pushing the code? @vinothchandar @yanghua
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org