You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/12/20 09:43:14 UTC

[GitHub] [incubator-hudi] cdmikechen opened a new pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

cdmikechen opened a new pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119
 
 
   When I run hudi cli to get insert rows, I found that hudi cli can not get insert rows if it is not in first commit time. I found that `HoodieCommitMetadata.fetchTotalInsertRecordsWritten()` method use ` stat.getPrevCommit().equalsIgnoreCase("null")` to filter first commit. So I think this check option should be removed。
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   It was only possible to determine the number of rows to insert for the first commit before. Currently, to fixe this problem.
   
   ## Brief change log
   
    hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java | 4 +++-
    1 file changed, 3 insertions(+), 1 deletion(-)
   
   ## Verify this pull request
   
   have run TestHoodieCommitMetadata.java to test.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [x] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r365047872
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,8 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
+        // It was only possible to determine the number of rows to insert for the first commit before. Currently, this problem is fixed
 
 Review comment:
   nit : no need to explain "Currently, this problem is fixed" :) We can just comment saying "determine the number of rows to insert for the first commit". Once you do that, we can merge this!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361044029
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash
   I think the processing logic of this method is the same as that of `fetchTotalUpdateRecordsWritten `
   ```
     public long fetchTotalUpdateRecordsWritten() {
       long totalUpdateRecordsWritten = 0;
       for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
         for (HoodieWriteStat stat : stats) {
           totalUpdateRecordsWritten += stat.getNumUpdateWrites();
         }
       }
       return totalUpdateRecordsWritten;
     }
   ```
   Let me give you an example in my test. I create a hudi data with 3 rows first, and then insert a row, at last insert a row and delete a row. I print all method return number and test the number to check.
   ```java
   for (HoodieInstant commit : commits) {
       HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(commit).get(),
               HoodieCommitMetadata.class);
   
       JSONObject data = new JSONObject();
       data.put("timestamp", commit.getTimestamp());
       data.put("fetchTotalBytesWritten", commitMetadata.fetchTotalBytesWritten());
       data.put("fetchTotalFilesInsert", commitMetadata.fetchTotalFilesInsert());
       data.put("fetchTotalFilesUpdated", commitMetadata.fetchTotalFilesUpdated());
       data.put("fetchTotalPartitionsWritten", commitMetadata.fetchTotalPartitionsWritten());
       data.put("fetchTotalRecordsWritten", commitMetadata.fetchTotalRecordsWritten());
       data.put("fetchTotalUpdateRecordsWritten", commitMetadata.fetchTotalUpdateRecordsWritten());
       data.put("fetchTotalInsertRecordsWritten", commitMetadata.fetchTotalInsertRecordsWritten());
       data.put("fetchTotalWriteErrors", commitMetadata.fetchTotalWriteErrors());
   
       long totalDeleteRecordsWritten= 0;
       for (List<HoodieWriteStat> stats : commitMetadata.getPartitionToWriteStats().values()) {
           for (HoodieWriteStat stat : stats) {
               if (stat.getPrevCommit() != null) {
                   totalDeleteRecordsWritten += stat.getNumDeletes();
               }
           }
       }
       data.put("fetchTotalDeleteRecordsWritten", totalDeleteRecordsWritten);
   
       datas.add(data);
   }
   log.debug("datas = {}", datas);
   ```
   If I use old if `(stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null"))` it will print:
   ```json
   [
   	{
   		"fetchTotalBytesWritten": 435258,
   		"fetchTotalDeleteRecordsWritten": 1,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 0,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923172508"
   	},
   	{
   		"fetchTotalBytesWritten": 435227,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 0,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160439"
   	},
   	{
   		"fetchTotalBytesWritten": 435478,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 1,
   		"fetchTotalFilesUpdated": 0,
   		"fetchTotalInsertRecordsWritten": 3,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 3,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160340"
   	}
   ]
   ```
   If I use modified if `(stat.getPrevCommit() != null)` it will print:
   ```json
   [
   	{
   		"fetchTotalBytesWritten": 435258,
   		"fetchTotalDeleteRecordsWritten": 1,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 1,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923172508"
   	},
   	{
   		"fetchTotalBytesWritten": 435227,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 1,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160439"
   	},
   	{
   		"fetchTotalBytesWritten": 435478,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 1,
   		"fetchTotalFilesUpdated": 0,
   		"fetchTotalInsertRecordsWritten": 3,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 3,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160340"
   	}
   ]
   ```
   I guess the person who originally wrote this method copied the `fetchTotalFilesInsert` method at that time, but he forgot that the processing logic of adding files is different from that adding data in files.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on a change in pull request #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r365119041
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,8 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
+        // It was only possible to determine the number of rows to insert for the first commit before. Currently, this problem is fixed
 
 Review comment:
   @n3nash 
   OK~ have changed it. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r365478532
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,8 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
+        // It was only possible to determine the number of rows to insert for the first commit before. Currently, this problem is fixed
 
 Review comment:
   Thank you!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361278242
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash  Yes, that's almost what it means! 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361074375
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash
   sorry I may misunderstand your mind, I'll take a look at what I've tested later.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen commented on issue #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on issue #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#issuecomment-568941316
 
 
   > Lets first triage the issue, file a JIRA and then proceed with a PR?
   
   OK! I have open a JIRA issue HUDI-469 and rename the title.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen commented on issue #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on issue #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#issuecomment-572833967
 
 
   @n3nash have deleted, you can check again, please.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash merged pull request #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
n3nash merged pull request #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361078012
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash 
   It may be in https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java#L165
   ```java
         HoodieWriteStat stat = new HoodieWriteStat();
         stat.setPartitionPath(writeStatus.getPartitionPath());
         stat.setNumWrites(recordsWritten);
         stat.setNumDeletes(recordsDeleted);
         stat.setNumInserts(insertRecordsWritten);
         stat.setPrevCommit(HoodieWriteStat.NULL_COMMIT);
         stat.setFileId(writeStatus.getFileId());
         stat.setPath(new Path(config.getBasePath()), path);
         long fileSizeInBytes = FSUtils.getFileSize(fs, path);
         stat.setTotalWriteBytes(fileSizeInBytes);
         stat.setFileSizeInBytes(fileSizeInBytes);
         stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords());
         RuntimeStats runtimeStats = new RuntimeStats();
         runtimeStats.setTotalCreateTime(timer.endTimer());
         stat.setRuntimeStats(runtimeStats);
         writeStatus.setStat(stat);
   ```
   In `org.apache.hudi.common.model.HoodieWriteStat` it is a "null", in other cases prevCommit will be set a real commit time.
   ```java
   public static final String NULL_COMMIT = "null";
   ```
   So that `fetchTotalFilesInsert` can recognize first commit file and pass the condition.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361078012
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash 
   It may be in https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java#L165
   ```java
         HoodieWriteStat stat = new HoodieWriteStat();
         stat.setPartitionPath(writeStatus.getPartitionPath());
         stat.setNumWrites(recordsWritten);
         stat.setNumDeletes(recordsDeleted);
         stat.setNumInserts(insertRecordsWritten);
         stat.setPrevCommit(HoodieWriteStat.NULL_COMMIT);
         stat.setFileId(writeStatus.getFileId());
         stat.setPath(new Path(config.getBasePath()), path);
         long fileSizeInBytes = FSUtils.getFileSize(fs, path);
         stat.setTotalWriteBytes(fileSizeInBytes);
         stat.setFileSizeInBytes(fileSizeInBytes);
         stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords());
         RuntimeStats runtimeStats = new RuntimeStats();
         runtimeStats.setTotalCreateTime(timer.endTimer());
         stat.setRuntimeStats(runtimeStats);
         writeStatus.setStat(stat);
   ```
   In `org.apache.hudi.common.model.HoodieWriteStat` it is a "null", in other cases prevCommit will be set a real commit time.
   ```java
   public static final String NULL_COMMIT = "null";
   ```
   So that `fetchTotalFilesInsert` can recognize first commit file and pass the condition, meanwhile `fetchTotalInsertRecordsWritten` can not.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r360986944
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   For the first commit, the prevCommmit value is actually "null" which is the reason why this check is here. 
   Can you inspect and see if stat.getPrevCommit() returns null in some cases ? Then, the above condition will not be satisfied.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361273870
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @cdmikechen I think the problem is here : The FileSizing of Hoodie takes care of padding extra records to the same file till the file grows to a certain size. This logic lies in HoodieMergeHandle.java. Since the merge handle will always have a prevCommit, for the records that are inserted as part of the HoodieMergeHandle (records padded to an existing file), those will not pass the condition of stat.getPrevCommit().equalsIgnoreCase("null"). So, I think removing the check `stat.getPrevCommit().equalsIgnoreCase("null")` and then adding the num-inserts should be the right thing to do. Let me know if this makes sense to you.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361044029
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash
   I think the processing logic of this method is the same as that of `fetchTotalFilesUpdated`
   ```
     public long fetchTotalUpdateRecordsWritten() {
       long totalUpdateRecordsWritten = 0;
       for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
         for (HoodieWriteStat stat : stats) {
           totalUpdateRecordsWritten += stat.getNumUpdateWrites();
         }
       }
       return totalUpdateRecordsWritten;
     }
   ```
   Let me give you an example in my test. I create a hudi data with 3 rows first, and then insert a row, at last insert a row and delete a row. I print all method return number and test the number to check.
   ```java
   for (HoodieInstant commit : commits) {
       HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(commit).get(),
               HoodieCommitMetadata.class);
   
       JSONObject data = new JSONObject();
       data.put("timestamp", commit.getTimestamp());
       data.put("fetchTotalBytesWritten", commitMetadata.fetchTotalBytesWritten());
       data.put("fetchTotalFilesInsert", commitMetadata.fetchTotalFilesInsert());
       data.put("fetchTotalFilesUpdated", commitMetadata.fetchTotalFilesUpdated());
       data.put("fetchTotalPartitionsWritten", commitMetadata.fetchTotalPartitionsWritten());
       data.put("fetchTotalRecordsWritten", commitMetadata.fetchTotalRecordsWritten());
       data.put("fetchTotalUpdateRecordsWritten", commitMetadata.fetchTotalUpdateRecordsWritten());
       data.put("fetchTotalInsertRecordsWritten", commitMetadata.fetchTotalInsertRecordsWritten());
       data.put("fetchTotalWriteErrors", commitMetadata.fetchTotalWriteErrors());
   
       long totalDeleteRecordsWritten= 0;
       for (List<HoodieWriteStat> stats : commitMetadata.getPartitionToWriteStats().values()) {
           for (HoodieWriteStat stat : stats) {
               if (stat.getPrevCommit() != null) {
                   totalDeleteRecordsWritten += stat.getNumDeletes();
               }
           }
       }
       data.put("fetchTotalDeleteRecordsWritten", totalDeleteRecordsWritten);
   
       datas.add(data);
   }
   log.debug("datas = {}", datas);
   ```
   If I user old if `(stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null"))` it will print:
   ```json
   [
   	{
   		"fetchTotalBytesWritten": 435258,
   		"fetchTotalDeleteRecordsWritten": 1,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 0,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923172508"
   	},
   	{
   		"fetchTotalBytesWritten": 435227,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 0,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160439"
   	},
   	{
   		"fetchTotalBytesWritten": 435478,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 1,
   		"fetchTotalFilesUpdated": 0,
   		"fetchTotalInsertRecordsWritten": 3,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 3,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160340"
   	}
   ]
   ```
   If I user modified if `(stat.getPrevCommit() != null)` it will print:
   ```json
   [
   	{
   		"fetchTotalBytesWritten": 435258,
   		"fetchTotalDeleteRecordsWritten": 1,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 1,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923172508"
   	},
   	{
   		"fetchTotalBytesWritten": 435227,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 1,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160439"
   	},
   	{
   		"fetchTotalBytesWritten": 435478,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 1,
   		"fetchTotalFilesUpdated": 0,
   		"fetchTotalInsertRecordsWritten": 3,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 3,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160340"
   	}
   ]
   ```
   I guess the person who originally wrote this method copied the `fetchTotalFilesInsert` method at that time, but he forgot that the processing logic of adding files is different from that adding data in files.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361044029
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##########
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
     long totalInsertRecordsWritten = 0;
     for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
       for (HoodieWriteStat stat : stats) {
-        if (stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash
   I think the processing logic of this method is the same as that of `fetchTotalFilesUpdated`
   ```
     public long fetchTotalUpdateRecordsWritten() {
       long totalUpdateRecordsWritten = 0;
       for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
         for (HoodieWriteStat stat : stats) {
           totalUpdateRecordsWritten += stat.getNumUpdateWrites();
         }
       }
       return totalUpdateRecordsWritten;
     }
   ```
   Let me give you an example in my test. I create a hudi data with 3 rows first, and then insert a row, at last insert a row and delete a row. I print all method return number and test the number to check.
   ```java
   for (HoodieInstant commit : commits) {
       HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(commit).get(),
               HoodieCommitMetadata.class);
   
       JSONObject data = new JSONObject();
       data.put("timestamp", commit.getTimestamp());
       data.put("fetchTotalBytesWritten", commitMetadata.fetchTotalBytesWritten());
       data.put("fetchTotalFilesInsert", commitMetadata.fetchTotalFilesInsert());
       data.put("fetchTotalFilesUpdated", commitMetadata.fetchTotalFilesUpdated());
       data.put("fetchTotalPartitionsWritten", commitMetadata.fetchTotalPartitionsWritten());
       data.put("fetchTotalRecordsWritten", commitMetadata.fetchTotalRecordsWritten());
       data.put("fetchTotalUpdateRecordsWritten", commitMetadata.fetchTotalUpdateRecordsWritten());
       data.put("fetchTotalInsertRecordsWritten", commitMetadata.fetchTotalInsertRecordsWritten());
       data.put("fetchTotalWriteErrors", commitMetadata.fetchTotalWriteErrors());
   
       long totalDeleteRecordsWritten= 0;
       for (List<HoodieWriteStat> stats : commitMetadata.getPartitionToWriteStats().values()) {
           for (HoodieWriteStat stat : stats) {
               if (stat.getPrevCommit() != null) {
                   totalDeleteRecordsWritten += stat.getNumDeletes();
               }
           }
       }
       data.put("fetchTotalDeleteRecordsWritten", totalDeleteRecordsWritten);
   
       datas.add(data);
   }
   log.debug("datas = {}", datas);
   ```
   If I use old if `(stat.getPrevCommit() != null && stat.getPrevCommit().equalsIgnoreCase("null"))` it will print:
   ```json
   [
   	{
   		"fetchTotalBytesWritten": 435258,
   		"fetchTotalDeleteRecordsWritten": 1,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 0,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923172508"
   	},
   	{
   		"fetchTotalBytesWritten": 435227,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 0,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160439"
   	},
   	{
   		"fetchTotalBytesWritten": 435478,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 1,
   		"fetchTotalFilesUpdated": 0,
   		"fetchTotalInsertRecordsWritten": 3,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 3,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160340"
   	}
   ]
   ```
   If I use modified if `(stat.getPrevCommit() != null)` it will print:
   ```json
   [
   	{
   		"fetchTotalBytesWritten": 435258,
   		"fetchTotalDeleteRecordsWritten": 1,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 1,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923172508"
   	},
   	{
   		"fetchTotalBytesWritten": 435227,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 0,
   		"fetchTotalFilesUpdated": 1,
   		"fetchTotalInsertRecordsWritten": 1,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 4,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160439"
   	},
   	{
   		"fetchTotalBytesWritten": 435478,
   		"fetchTotalDeleteRecordsWritten": 0,
   		"fetchTotalFilesInsert": 1,
   		"fetchTotalFilesUpdated": 0,
   		"fetchTotalInsertRecordsWritten": 3,
   		"fetchTotalPartitionsWritten": 1,
   		"fetchTotalRecordsWritten": 3,
   		"fetchTotalUpdateRecordsWritten": 0,
   		"fetchTotalWriteErrors": 0,
   		"timestamp": "20190923160340"
   	}
   ]
   ```
   I guess the person who originally wrote this method copied the `fetchTotalFilesInsert` method at that time, but he forgot that the processing logic of adding files is different from that adding data in files.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on issue #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1119: [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#issuecomment-572843925
 
 
   @cdmikechen one minor comment after which we can merge this!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services