You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/28 04:17:43 UTC
[GitHub] [hudi] satishkotha opened a new pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
satishkotha opened a new pull request #2502:
URL: https://github.com/apache/hudi/pull/2502
## What is the purpose of the pull request
Fix 2 issues found during large scale testing of clustering
## Brief change log
* writeStatusRDD.isEmpty() call is taking 15-25% empty and limiting parallelism for clustering.
* bug using 'int' during plan generation instead of long is limiting clustering max group size.
## Verify this pull request
This pull request is already covered by existing tests, such as.
1)TestHoodieClientOnCopyOnWriteStorage#testClusteringWithSortColumns, testSimpleClustering
2) TestHoodieMergeOnReadTable#testSimpleClusteringWithUpdates, testSimpleClusteringNoUpdates
I repeated this test on large scale dataset and verified this improves runtime of clustering operation by 20%
We can add more specific tests for measuring performance and validating clustering plan. I plan to take it up as separate item. This probably requires updating 0.7 release because performance improvement is substantial
## Committer checklist
- [ ] Has a corresponding JIRA in PR title & commit
- [ ] Commit message is descriptive of the change
- [ ] CI is green
- [ ] Necessary doc changes done or have another open PR
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2502:
URL: https://github.com/apache/hudi/pull/2502#discussion_r566378842
##########
File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestStructuredStreaming.scala
##########
@@ -243,17 +243,24 @@ class TestStructuredStreaming extends HoodieClientTestBase {
val f2 = Future {
inputDF1.coalesce(1).write.mode(SaveMode.Append).json(sourcePath)
// wait for spark streaming to process one microbatch
- val currNumCommits = waitTillAtleastNCommits(fs, destPath, 1, 120, 5)
+ var currNumCommits = waitTillAtleastNCommits(fs, destPath, 1, 120, 5)
assertTrue(HoodieDataSourceHelpers.hasNewCommits(fs, destPath, "000"))
inputDF2.coalesce(1).write.mode(SaveMode.Append).json(sourcePath)
// wait for spark streaming to process second microbatch
- waitTillAtleastNCommits(fs, destPath, currNumCommits + 1, 120, 5)
- assertEquals(2, HoodieDataSourceHelpers.listCommitsSince(fs, destPath, "000").size())
-
- // check have more than one file group
- this.metaClient = new HoodieTableMetaClient(fs.getConf, destPath, true)
- assertTrue(getLatestFileGroupsFileId(partitionOfRecords).size > 1)
+ currNumCommits = waitTillAtleastNCommits(fs, destPath, currNumCommits + 1, 120, 5)
+ // for inline clustering, clustering may be complete along with 2nd commit
+ if (HoodieDataSourceHelpers.allCompletedCommitsCompactions(fs, destPath).getCompletedReplaceTimeline().countInstants() > 0) {
Review comment:
@lw309637554 It seems like this test has a race condition. PTAL and confirm my fix is reasonable.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2502:
URL: https://github.com/apache/hudi/pull/2502#issuecomment-769063418
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=h1) Report
> Merging [#2502](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=desc) (f903c85) into [master](https://codecov.io/gh/apache/hudi/commit/c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1?el=desc) (c8ee40f) will **decrease** coverage by `40.49%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2502/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2502 +/- ##
============================================
- Coverage 50.18% 9.68% -40.50%
+ Complexity 3051 48 -3003
============================================
Files 419 53 -366
Lines 18931 1930 -17001
Branches 1948 230 -1718
============================================
- Hits 9501 187 -9314
+ Misses 8656 1730 -6926
+ Partials 774 13 -761
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudiflink | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudisparkdatasource | `?` | `?` | |
| hudisync | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `9.68% <ø> (-59.80%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
| [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
| [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
| [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
| [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| ... and [395 more](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2502:
URL: https://github.com/apache/hudi/pull/2502#discussion_r565810597
##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/plan/strategy/SparkRecentDaysClusteringPlanStrategy.java
##########
@@ -71,20 +71,26 @@ public SparkRecentDaysClusteringPlanStrategy(HoodieSparkMergeOnReadTable<T> tabl
protected Stream<HoodieClusteringGroup> buildClusteringGroupsForPartition(String partitionPath, List<FileSlice> fileSlices) {
List<Pair<List<FileSlice>, Integer>> fileSliceGroups = new ArrayList<>();
List<FileSlice> currentGroup = new ArrayList<>();
- int totalSizeSoFar = 0;
+ long totalSizeSoFar = 0;
for (FileSlice currentSlice : fileSlices) {
// assume each filegroup size is ~= parquet.max.file.size
totalSizeSoFar += currentSlice.getBaseFile().isPresent() ? currentSlice.getBaseFile().get().getFileSize() : getWriteConfig().getParquetMaxFileSize();
// check if max size is reached and create new group, if needed.
if (totalSizeSoFar >= getWriteConfig().getClusteringMaxBytesInGroup() && !currentGroup.isEmpty()) {
- fileSliceGroups.add(Pair.of(currentGroup, getNumberOfOutputFileGroups(totalSizeSoFar, getWriteConfig().getClusteringTargetFileMaxBytes())));
+ int numOutputGroups = getNumberOfOutputFileGroups(totalSizeSoFar, getWriteConfig().getClusteringTargetFileMaxBytes());
Review comment:
this is just for logging improvement
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha merged pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
Posted by GitBox <gi...@apache.org>.
satishkotha merged pull request #2502:
URL: https://github.com/apache/hudi/pull/2502
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on a change in pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2502:
URL: https://github.com/apache/hudi/pull/2502#discussion_r566695943
##########
File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestStructuredStreaming.scala
##########
@@ -243,17 +243,24 @@ class TestStructuredStreaming extends HoodieClientTestBase {
val f2 = Future {
inputDF1.coalesce(1).write.mode(SaveMode.Append).json(sourcePath)
// wait for spark streaming to process one microbatch
- val currNumCommits = waitTillAtleastNCommits(fs, destPath, 1, 120, 5)
+ var currNumCommits = waitTillAtleastNCommits(fs, destPath, 1, 120, 5)
assertTrue(HoodieDataSourceHelpers.hasNewCommits(fs, destPath, "000"))
inputDF2.coalesce(1).write.mode(SaveMode.Append).json(sourcePath)
// wait for spark streaming to process second microbatch
- waitTillAtleastNCommits(fs, destPath, currNumCommits + 1, 120, 5)
- assertEquals(2, HoodieDataSourceHelpers.listCommitsSince(fs, destPath, "000").size())
-
- // check have more than one file group
- this.metaClient = new HoodieTableMetaClient(fs.getConf, destPath, true)
- assertTrue(getLatestFileGroupsFileId(partitionOfRecords).size > 1)
+ currNumCommits = waitTillAtleastNCommits(fs, destPath, currNumCommits + 1, 120, 5)
+ // for inline clustering, clustering may be complete along with 2nd commit
+ if (HoodieDataSourceHelpers.allCompletedCommitsCompactions(fs, destPath).getCompletedReplaceTimeline().countInstants() > 0) {
Review comment:
it's reasonable
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2502:
URL: https://github.com/apache/hudi/pull/2502#issuecomment-769697170
LGTM
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #2502:
URL: https://github.com/apache/hudi/pull/2502#issuecomment-769063418
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=h1) Report
> Merging [#2502](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=desc) (434a7d7) into [master](https://codecov.io/gh/apache/hudi/commit/c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1?el=desc) (c8ee40f) will **decrease** coverage by `40.49%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2502/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2502 +/- ##
============================================
- Coverage 50.18% 9.68% -40.50%
+ Complexity 3051 48 -3003
============================================
Files 419 53 -366
Lines 18931 1930 -17001
Branches 1948 230 -1718
============================================
- Hits 9501 187 -9314
+ Misses 8656 1730 -6926
+ Partials 774 13 -761
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudiflink | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudisparkdatasource | `?` | `?` | |
| hudisync | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `9.68% <ø> (-59.80%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
| [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
| [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
| [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
| [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| ... and [395 more](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2502:
URL: https://github.com/apache/hudi/pull/2502#issuecomment-769063418
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=h1) Report
> Merging [#2502](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=desc) (f903c85) into [master](https://codecov.io/gh/apache/hudi/commit/c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1?el=desc) (c8ee40f) will **increase** coverage by `0.08%`.
> The diff coverage is `0.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2502/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2502 +/- ##
============================================
+ Coverage 50.18% 50.27% +0.08%
- Complexity 3051 3120 +69
============================================
Files 419 430 +11
Lines 18931 19565 +634
Branches 1948 2004 +56
============================================
+ Hits 9501 9836 +335
- Misses 8656 8925 +269
- Partials 774 804 +30
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `37.21% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudicommon | `51.49% <ø> (+<0.01%)` | `0.00 <ø> (ø)` | |
| hudiflink | `33.03% <0.00%> (+33.03%)` | `0.00 <0.00> (ø)` | |
| hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudisparkdatasource | `65.85% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
| huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiutilities | `69.48% <ø> (ø)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [.../apache/hudi/operator/InstantGenerateOperator.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9JbnN0YW50R2VuZXJhdGVPcGVyYXRvci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
| [...i/common/model/OverwriteWithLatestAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZVdpdGhMYXRlc3RBdnJvUGF5bG9hZC5qYXZh) | `60.00% <0.00%> (-4.71%)` | `10.00% <0.00%> (ø%)` | |
| [...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==) | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | |
| [...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==) | `56.09% <0.00%> (ø)` | `38.00% <0.00%> (+1.00%)` | |
| [...main/java/org/apache/hudi/HoodieFlinkStreamer.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVGbGlua1N0cmVhbWVyLmphdmE=) | | | |
| [...ache/hudi/operator/StreamWriteOperatorFactory.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (?%)` | |
| [...he/hudi/operator/event/BatchWriteSuccessEvent.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9ldmVudC9CYXRjaFdyaXRlU3VjY2Vzc0V2ZW50LmphdmE=) | `100.00% <0.00%> (ø)` | `4.00% <0.00%> (?%)` | |
| [.../org/apache/hudi/operator/StreamWriteFunction.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZUZ1bmN0aW9uLmphdmE=) | `80.55% <0.00%> (ø)` | `14.00% <0.00%> (?%)` | |
| [.../org/apache/hudi/util/RowDataToAvroConverters.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1Jvd0RhdGFUb0F2cm9Db252ZXJ0ZXJzLmphdmE=) | `42.05% <0.00%> (ø)` | `8.00% <0.00%> (?%)` | |
| [...java/org/apache/hudi/util/AvroSchemaConverter.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL0F2cm9TY2hlbWFDb252ZXJ0ZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (?%)` | |
| ... and [11 more](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2502:
URL: https://github.com/apache/hudi/pull/2502#issuecomment-769063418
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=h1) Report
> Merging [#2502](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=desc) (434a7d7) into [master](https://codecov.io/gh/apache/hudi/commit/c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1?el=desc) (c8ee40f) will **increase** coverage by `19.24%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2502/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2502 +/- ##
=============================================
+ Coverage 50.18% 69.43% +19.24%
+ Complexity 3051 357 -2694
=============================================
Files 419 53 -366
Lines 18931 1930 -17001
Branches 1948 230 -1718
=============================================
- Hits 9501 1340 -8161
+ Misses 8656 456 -8200
+ Partials 774 134 -640
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudiflink | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudisparkdatasource | `?` | `?` | |
| hudisync | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `69.43% <ø> (-0.06%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2502?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.50% <0.00%> (-0.36%)` | `50.00% <0.00%> (-1.00%)` | |
| [...apache/hudi/common/engine/TaskContextSupplier.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9UYXNrQ29udGV4dFN1cHBsaWVyLmphdmE=) | | | |
| [...sioning/clean/CleanMetadataV1MigrationHandler.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5NZXRhZGF0YVYxTWlncmF0aW9uSGFuZGxlci5qYXZh) | | | |
| [...e/hudi/exception/HoodieDeltaStreamerException.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZURlbHRhU3RyZWFtZXJFeGNlcHRpb24uamF2YQ==) | | | |
| [...e/hudi/common/util/collection/RocksDBBasedMap.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvY29sbGVjdGlvbi9Sb2Nrc0RCQmFzZWRNYXAuamF2YQ==) | | | |
| [...apache/hudi/common/util/ParquetReaderIterator.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvUGFycXVldFJlYWRlckl0ZXJhdG9yLmphdmE=) | | | |
| [...a/org/apache/hudi/common/bloom/InternalFilter.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL0ludGVybmFsRmlsdGVyLmphdmE=) | | | |
| [.../org/apache/hudi/io/storage/HoodieHFileReader.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vc3RvcmFnZS9Ib29kaWVIRmlsZVJlYWRlci5qYXZh) | | | |
| [...ache/hudi/cli/commands/ArchivedCommitsCommand.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL0FyY2hpdmVkQ29tbWl0c0NvbW1hbmQuamF2YQ==) | | | |
| [...e/hudi/exception/HoodieRecordMissingException.java](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZVJlY29yZE1pc3NpbmdFeGNlcHRpb24uamF2YQ==) | | | |
| ... and [356 more](https://codecov.io/gh/apache/hudi/pull/2502/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2502: [HUDI-1555] Remove isEmpty to improve clustering execution performance
Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2502:
URL: https://github.com/apache/hudi/pull/2502#discussion_r565810597
##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/plan/strategy/SparkRecentDaysClusteringPlanStrategy.java
##########
@@ -71,20 +71,26 @@ public SparkRecentDaysClusteringPlanStrategy(HoodieSparkMergeOnReadTable<T> tabl
protected Stream<HoodieClusteringGroup> buildClusteringGroupsForPartition(String partitionPath, List<FileSlice> fileSlices) {
List<Pair<List<FileSlice>, Integer>> fileSliceGroups = new ArrayList<>();
List<FileSlice> currentGroup = new ArrayList<>();
- int totalSizeSoFar = 0;
+ long totalSizeSoFar = 0;
for (FileSlice currentSlice : fileSlices) {
// assume each filegroup size is ~= parquet.max.file.size
totalSizeSoFar += currentSlice.getBaseFile().isPresent() ? currentSlice.getBaseFile().get().getFileSize() : getWriteConfig().getParquetMaxFileSize();
// check if max size is reached and create new group, if needed.
if (totalSizeSoFar >= getWriteConfig().getClusteringMaxBytesInGroup() && !currentGroup.isEmpty()) {
- fileSliceGroups.add(Pair.of(currentGroup, getNumberOfOutputFileGroups(totalSizeSoFar, getWriteConfig().getClusteringTargetFileMaxBytes())));
+ int numOutputGroups = getNumberOfOutputFileGroups(totalSizeSoFar, getWriteConfig().getClusteringTargetFileMaxBytes());
Review comment:
These are just logging improvements
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org