You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/05/08 01:43:34 UTC

[GitHub] [incubator-hudi] garyli1019 opened a new pull request #1602: HUDI-494 fix incorrect record size estimation

garyli1019 opened a new pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602


   ## What is the purpose of the pull request
   
   JIRA: https://issues.apache.org/jira/browse/HUDI-494
   
   This PR fixes an edge case: 
   the previous commit is very small but the bloom filter is large, this will cause the `totalWriteBytes` in the commit metadata not representing the record size. For example, commit1 only wrote 1 record but the parquet file is 20MB. This will lead to incorrect estimate record size and a very small number of [recordsToAppend](https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L155). Then Hudi will create many small files.
   
   ## Brief change log
   
     - changed `averageBytesPerRecord` function in two locations
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-commenter commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-632410484


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=h1) Report
   > Merging [#1602](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/802d16c8c9793156ef7fef0c59088040800fe025&el=desc) will **increase** coverage by `0.01%`.
   > The diff coverage is `50.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1602/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #1602      +/-   ##
   ============================================
   + Coverage     18.33%   18.35%   +0.01%     
     Complexity      855      855              
   ============================================
     Files           344      344              
     Lines         15167    15149      -18     
     Branches       1512     1509       -3     
   ============================================
   - Hits           2781     2780       -1     
   + Misses        12033    12016      -17     
     Partials        353      353              
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh) | `7.14% <ø> (+1.39%)` | `4.00 <0.00> (ø)` | |
   | [...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh) | `55.07% <50.00%> (-0.33%)` | `15.00 <1.00> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=footer). Last update [802d16c...df6e291](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] nsivabalan commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-630126880


   I have started going thru the code base, but a naive question in the mean time. If we could log bytesOccupied by actual records separately and then bytes including bloom separately, it might solve the issue and it would be simpler isn't? 
   
   Also, can one of you point me to the code where this size is estimated(specifically where the bloom is getting added in the size). I am going through HoodieWriteStat and HoodieAppendHandle for how estimatedNumberOfBytesWritten is calculated and I don't see the bloom filter is getting added. am I looking at the wrong place?
   
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-629898227


   Let’s push this down the line for 0.6.0
   
   @garyli1019 we probably need an alternative strategy here that is more aggressive. But I see bloom filter as part of the per record overhead.. let me review your latest change..
   
   We need to introduce some sizing strategy abstractions in the code ultimately.. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628934924


   > commit1 only wrote 1 record but the parquet file is 20MB 
   
   @vinothchandar Sorry this example is bad... Let's say 8MB(2M entries) bloom filter + 200 records producing a 10MB parquet. If these 200 records are assigned to an existing partition, it will be very likely inserted into the existing file, so no problem. But if it goes to a new partition, then this small file is inevitable. 
   
   In the next run, we consider this 10MB file as a small file. We calculate `averageRecordSize = 10MB/200 = 50KB`, let's say we set max parquet size to 100MB, `(100MB - 10MB)/50KB = 1800` records to file this small file. For other files, each will be assigned 2000 records. 
   
   Yes, we can never get a pure record size. Even we deduct the bloom filter size in this case, with other metadata, we still have `(2MB/200) = 10KB`, which will produce a bit larger small files... 
   
   Another idea in my mind:
   
   - if the `totalBytesWritten` is less than the `DEFAULT_PARQUET_SMALL_FILE_LIMIT_BYTES`, then skip calculating size from this commit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-626385758


   Hello @bvaradar , I'd like to get your opinion about how to fix this issue, because if we change the way to calculate the record size, it will impact many places in the codebase. 
   The issue basically is:
   The `totalBytesWritten` in metadata included bloom filter, so `totalBytesWritten/totalRecordsWritten` will be off if the number of records is small but bloom filter is large.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#discussion_r428959265



##########
File path: hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
##########
@@ -70,7 +70,9 @@
 public class HoodieTestDataGenerator {
 
   // based on examination of sample file, the schema produces the following per record size
-  public static final int SIZE_PER_RECORD = 50 * 1024;
+  public static final int SIZE_PER_RECORD = (int) (1.2 * 1024);

Review comment:
       since you are at it, can we fix the naming. can you add suffix the units for size. whether its bits or bytes. 

##########
File path: hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
##########
@@ -70,7 +70,9 @@
 public class HoodieTestDataGenerator {
 
   // based on examination of sample file, the schema produces the following per record size
-  public static final int SIZE_PER_RECORD = 50 * 1024;
+  public static final int SIZE_PER_RECORD = (int) (1.2 * 1024);

Review comment:
       same for BLOOM_FILTER_SIZE




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] nsivabalan commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-632389785


   @garyli1019 : thanks for clarifying. that was helpful. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] nsivabalan edited a comment on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-630757894


   I was looking at HoodieAppendHandle. thanks for the link @garyli1019 . Now I get it. We could make (not sure if it makes sense), HoodieStorageWriter(bcoz, the storage writer is the one which has access to bloom) expose size occupied by Bloom, then we should be ok. Do you think it's doable? 
   
   ```
   long fileSizeInBytes = FSUtils.getFileSize(fs, newFilePath); 
   long bloomFilterSizeInBytes = storageWriter.getBloomFilterSize();
   stat.setTotalWriteBytes(fileSizeInBytes - bloomFilterSizeInBytes);
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628817124


   @vinothchandar I definitely agree a statistical table would be a better approach, but it will take a while I believe. I am happy to contribute to this topic as well. 
   Any other recommendation for a short term fix for this issue? I believe this bug could happen again. When something upstream goes wrong, like Kafka or HDFS goes down in the production for a short period of time, Hudi will have a chance to make an abnormal small commit.
   Regarding the bloom filter size, I think they all use the bloom filter entries and FP rate to calculate the size, for simple, dynamic, local, and global. Once we switch to the parquet native approach, we can change the way of the estimation. I think the calculation could be accurate. HBASE index is not covered in this PR though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628183877


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=h1) Report
   > Merging [#1602](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/32bada29dc95f1d5910713ae6b4f4a4ef39677c9&el=desc) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1602/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff            @@
   ##             master    #1602   +/-   ##
   =========================================
     Coverage     71.76%   71.76%           
   - Complexity     1087     1090    +3     
   =========================================
     Files           385      385           
     Lines         16584    16585    +1     
     Branches       1668     1666    -2     
   =========================================
   + Hits          11902    11903    +1     
   + Misses         3954     3953    -1     
   - Partials        728      729    +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh) | `57.14% <ø> (-4.93%)` | `4.00 <0.00> (ø)` | |
   | [...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh) | `95.65% <100.00%> (+0.68%)` | `15.00 <1.00> (ø)` | |
   | [...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==) | `79.31% <0.00%> (-10.35%)` | `0.00% <0.00%> (ø%)` | |
   | [...i/utilities/keygen/TimestampBasedKeyGenerator.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2tleWdlbi9UaW1lc3RhbXBCYXNlZEtleUdlbmVyYXRvci5qYXZh) | `58.82% <0.00%> (-0.16%)` | `7.00% <0.00%> (+2.00%)` | :arrow_down: |
   | [...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==) | `80.95% <0.00%> (+1.12%)` | `22.00% <0.00%> (ø%)` | |
   | [...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=) | `67.56% <0.00%> (+10.81%)` | `0.00% <0.00%> (ø%)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=footer). Last update [32bada2...88e2177](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-632241626


   ![image](https://user-images.githubusercontent.com/23007841/82587156-85ee5e80-9b4d-11ea-839f-798633fdd1a4.png)
   Hi @vinothchandar , unfortunately this bug happened to me again. Do you think the current patch make sense?
   The only change was to make `totalBytesWritten > 0` to `totalBytesWritten > hoodieWriteConfig.getParquetSmallFileLimit()` so we can skip using small commit to do the estimation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-631067632


   @nsivabalan There are a few concerns about using the bloom filter size:
   
   - Even we deduct the size of the bloom filter, there will be other metadata and the `totalWriteBytes` is still not representing the total record size. When the situation we discussed above happens, it is possible that the small files will still be produced.
   - This will increase the complexity when we handle other indexing like HbaseIndexing.
   
   I think this estimation will work either we have enough samples or we have an accurate total record size. At least one file larger than a small file size limit could give us enough samples. With enough samples, the metadata size could be neglected. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628183877


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=h1) Report
   > Merging [#1602](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/32bada29dc95f1d5910713ae6b4f4a4ef39677c9&el=desc) will **increase** coverage by `0.07%`.
   > The diff coverage is `96.15%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1602/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #1602      +/-   ##
   ============================================
   + Coverage     71.76%   71.84%   +0.07%     
   + Complexity     1087     1085       -2     
   ============================================
     Files           385      385              
     Lines         16584    16592       +8     
     Branches       1668     1674       +6     
   ============================================
   + Hits          11902    11920      +18     
   + Misses         3954     3946       -8     
   + Partials        728      726       -2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh) | `57.14% <ø> (-4.93%)` | `4.00 <0.00> (ø)` | |
   | [...org/apache/hudi/common/bloom/BloomFilterUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL0Jsb29tRmlsdGVyVXRpbHMuamF2YQ==) | `75.00% <0.00%> (ø)` | `3.00 <0.00> (ø)` | |
   | [...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh) | `96.07% <100.00%> (+1.11%)` | `16.00 <2.00> (+1.00)` | |
   | [...apache/hudi/common/model/HoodieCommitMetadata.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUNvbW1pdE1ldGFkYXRhLmphdmE=) | `57.65% <100.00%> (+2.61%)` | `27.00 <0.00> (-3.00)` | :arrow_up: |
   | [...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh) | `40.00% <0.00%> (-40.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==) | `80.95% <0.00%> (+1.12%)` | `22.00% <0.00%> (ø%)` | |
   | [...java/org/apache/hudi/config/HoodieIndexConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZUluZGV4Q29uZmlnLmphdmE=) | `69.84% <0.00%> (+6.34%)` | `3.00% <0.00%> (ø%)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=footer). Last update [32bada2...2f8cf11](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628753707


   with dynamic bloom filter, we are going to bound the bloom filter size anyway.. And finally, parquet may have bloom filter support on its own.. I am not comfortable introducing complexity to solve this edge case..  
   
   It feels better to invest the energy to building a better statistically sound approach overall, than patch this,.. WDYT? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] nsivabalan commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-630757894


   I was looking at HoodieAppendHandle. thanks for the link @garyli1019 . But looks like The MergeHandle does have info on totalRecordswritten, totalInserted, totalUpdated, totalDeleted. So, if we could make (not sure if it makes sense), HoodieStorageWriter expose size occupied by Bloom, then we should be ok. Do you think it's doable? 
   
   ```
   long fileSizeInBytes = FSUtils.getFileSize(fs, newFilePath); 
   long bloomFilterSizeInBytes = storageWriter.getBloomFilterSize();
   stat.setTotalWriteBytes(fileSizeInBytes - bloomFilterSizeInBytes);
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628910318


   @garyli1019 thinking about it, even today without the bloom filters, the parquet size include additional stats and metadata contained internally.. So, it's never going to be pure record size, right? 
   
   >commit1 only wrote 1 record but the parquet file is 20MB
   This feels like a misconfigured bloom filter.. 20MB bloom filter is just too much.. Like I mentioned, the dynamic bloom filter approach has a capping on the footer size and hopefully something like this does not happen.. Also a future write will overlook at 20MB as small file, only if you explicitly lowered the default (100MB) to less than 20MB right?  Overall, I am saying - this configuration seems to be asking for trouble :P 
   
   cc @nsivabalan who implemented the dynamic bloom filters.. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628183877


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=h1) Report
   > Merging [#1602](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/32bada29dc95f1d5910713ae6b4f4a4ef39677c9&el=desc) will **decrease** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1602/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #1602      +/-   ##
   ============================================
   - Coverage     71.76%   71.73%   -0.04%     
   - Complexity     1087     1090       +3     
   ============================================
     Files           385      385              
     Lines         16584    16585       +1     
     Branches       1668     1666       -2     
   ============================================
   - Hits          11902    11897       -5     
   - Misses         3954     3960       +6     
     Partials        728      728              
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh) | `57.14% <ø> (-4.93%)` | `4.00 <0.00> (ø)` | |
   | [...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh) | `95.65% <100.00%> (+0.68%)` | `15.00 <1.00> (ø)` | |
   | [...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh) | `40.00% <0.00%> (-40.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==) | `79.31% <0.00%> (-10.35%)` | `0.00% <0.00%> (ø%)` | |
   | [...i/utilities/keygen/TimestampBasedKeyGenerator.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2tleWdlbi9UaW1lc3RhbXBCYXNlZEtleUdlbmVyYXRvci5qYXZh) | `58.82% <0.00%> (-0.16%)` | `7.00% <0.00%> (+2.00%)` | :arrow_down: |
   | [...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==) | `80.95% <0.00%> (+1.12%)` | `22.00% <0.00%> (ø%)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=footer). Last update [32bada2...88e2177](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] nsivabalan edited a comment on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-630757894


   I was looking at HoodieAppendHandle. thanks for the link @garyli1019 . Now I get it. We could make (not sure if it makes sense), HoodieStorageWriter(bcoz, the storage writer is the one which has access to bloom) expose size occupied by Bloom, then we should be ok. Do you think it's doable? 
   
   ```
   long bloomFilterSizeInBytes = storageWriter.getBloomFilterSize();
   stat.setBloomFilterSizeInBytes(fileSizeInBytes - bloomFilterSizeInBytes);
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-632410484


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=h1) Report
   > Merging [#1602](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/802d16c8c9793156ef7fef0c59088040800fe025&el=desc) will **increase** coverage by `0.01%`.
   > The diff coverage is `50.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1602/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #1602      +/-   ##
   ============================================
   + Coverage     18.33%   18.35%   +0.01%     
     Complexity      855      855              
   ============================================
     Files           344      344              
     Lines         15167    15149      -18     
     Branches       1512     1509       -3     
   ============================================
   - Hits           2781     2780       -1     
   + Misses        12033    12016      -17     
     Partials        353      353              
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh) | `7.14% <ø> (+1.39%)` | `4.00 <0.00> (ø)` | |
   | [...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh) | `55.07% <50.00%> (-0.33%)` | `15.00 <1.00> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=footer). Last update [802d16c...df6e291](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar edited a comment on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
vinothchandar edited a comment on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628910318


   @garyli1019 thinking about it, even today without the bloom filters, the parquet size include additional stats and metadata contained internally.. So, it's never going to be pure record size, right? 
   
   >commit1 only wrote 1 record but the parquet file is 20MB
   
   This feels like a misconfigured bloom filter.. 20MB bloom filter is just too much.. Like I mentioned, the dynamic bloom filter approach has a capping on the footer size and hopefully something like this does not happen.. Also a future write will overlook at 20MB as small file, only if you explicitly lowered the default (100MB) to less than 20MB right?  Overall, I am saying - this configuration seems to be asking for trouble :P 
   
   cc @nsivabalan who implemented the dynamic bloom filters.. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-629005508


   switched to `totalBytesWritten > hoodieWriteConfig.getParquetSmallFileLimit()`. I think this way would have minimal impact and handle this bug.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-630364085


   Hi @nsivabalan 
   > If we could log bytesOccupied by actual records separately and then bytes including bloom separately, it might solve the issue and it would be simpler isn't?
   
   Yes, at this point we don't log `bytesOccupied` anywhere. As Vinoth mentioned, we can build a separate statistical table in the long term. Or we can add another column to store this.
   
   In HoodieMergeHandle, the `fileSizeInBytes` was calculated by getting the file size directly, so we don't know how many bytes are from the bloom filter. 
   https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java#L284
   
   If you are looking for how the bloom filter was calculated:
   https://github.com/apache/incubator-hudi/blob/fa36082554373dd4dce3e3d3159ab87300a4601d/hudi-client/src/main/java/org/apache/hudi/io/storage/HoodieStorageWriterFactory.java#L57
   
   https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/bloom/BloomFilterFactory.java#L36


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628183877


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=h1) Report
   > Merging [#1602](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/32bada29dc95f1d5910713ae6b4f4a4ef39677c9&el=desc) will **increase** coverage by `0.07%`.
   > The diff coverage is `96.15%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1602/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #1602      +/-   ##
   ============================================
   + Coverage     71.76%   71.84%   +0.07%     
   + Complexity     1087     1085       -2     
   ============================================
     Files           385      385              
     Lines         16584    16592       +8     
     Branches       1668     1674       +6     
   ============================================
   + Hits          11902    11920      +18     
   + Misses         3954     3946       -8     
   + Partials        728      726       -2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh) | `57.14% <ø> (-4.93%)` | `4.00 <0.00> (ø)` | |
   | [...org/apache/hudi/common/bloom/BloomFilterUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL0Jsb29tRmlsdGVyVXRpbHMuamF2YQ==) | `75.00% <0.00%> (ø)` | `3.00 <0.00> (ø)` | |
   | [...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh) | `96.07% <100.00%> (+1.11%)` | `16.00 <2.00> (+1.00)` | |
   | [...apache/hudi/common/model/HoodieCommitMetadata.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUNvbW1pdE1ldGFkYXRhLmphdmE=) | `57.65% <100.00%> (+2.61%)` | `27.00 <0.00> (-3.00)` | :arrow_up: |
   | [...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh) | `40.00% <0.00%> (-40.00%)` | `0.00% <0.00%> (ø%)` | |
   | [...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==) | `80.95% <0.00%> (+1.12%)` | `22.00% <0.00%> (ø%)` | |
   | [...java/org/apache/hudi/config/HoodieIndexConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZUluZGV4Q29uZmlnLmphdmE=) | `69.84% <0.00%> (+6.34%)` | `3.00% <0.00%> (ø%)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=footer). Last update [32bada2...2f8cf11](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on a change in pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#discussion_r428971840



##########
File path: hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
##########
@@ -70,7 +70,9 @@
 public class HoodieTestDataGenerator {
 
   // based on examination of sample file, the schema produces the following per record size
-  public static final int SIZE_PER_RECORD = 50 * 1024;
+  public static final int SIZE_PER_RECORD = (int) (1.2 * 1024);

Review comment:
       thanks for reviewing. comments addressed. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] nsivabalan commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-629884272


   @vinothchandar : are we proceeding with the patch. I haven't started looking at it yet, but if we plan to get it into 0.5.3, we need to get this resolved asap. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-629887094


   We can push this to 0.6.0 if you guys prefer to have more discussions. If there is anything I can help please let me know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org