You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2020/01/23 18:52:00 UTC

[jira] [Created] (HUDI-574) CLI counts small file inserts as updates

Vinoth Chandar created HUDI-574:
-----------------------------------

             Summary: CLI counts small file inserts as updates
                 Key: HUDI-574
                 URL: https://issues.apache.org/jira/browse/HUDI-574
             Project: Apache Hudi (incubating)
          Issue Type: Bug
          Components: CLI
            Reporter: Vinoth Chandar
             Fix For: 0.6.0


User report : 
 
I'm trying to understand the {{.commit}} output and how it relates to the output from the {{hudi-cli}} tool and i'm finding it difficult to reconcile my findings. specifically, i want to know the number of updates/inserts/deletes across all partitions for a given commit (an upsert). From the {{cli}}:
hudi:exec_unit_ver->commit showpartitions --commit 20200108153617 
╔════════════════╤═══════════════════╤═════════════════════╤════════════════════════╤═══════════════════════╤═════════════════════╤══════════════╗
║ Partition Path │ Total Files Added │ Total Files Updated │ Total Records Inserted │ Total Records Updated │ Total Bytes Written │ Total Errors ║
╠════════════════╪═══════════════════╪═════════════════════╪════════════════════════╪═══════════════════════╪═════════════════════╪══════════════╣
║ 0              │ 0                 │ 9                   │ 0                      │ 2091                  │ 983.7 MB            │ 0            ║
╟────────────────┼───────────────────┼─────────────────────┼────────────────────────┼───────────────────────┼─────────────────────┼──────────────╢
But in the {{20200108153617.commit}} file for that commit one of the files in the partition "0" has
      "numInserts" : 44448,
so not sure why {{Total Records Inserted}} is reported as zero. I checked that the sum of {{numUpdateWrites}} across all files in the partition matches 2091. Generally, i think it would be helpful to have {{totalRecordsInserted}} {{totalRecordsUpdated}} {{totalRecordsDeleted}} in the commit metadata (although it's not a big issue to sum the individual numbers from each file in each partition).
 
[~vinoth]
 
On the counts, when I checked the code, its counting the inserts as updats, since Hudi packed them onto existing files, to honor target file size ..
for (HoodieWriteStat stat : stats) {
        if (stat.getPrevCommit().equals(HoodieWriteStat.NULL_COMMIT)) {
          totalFilesAdded += 1;
          totalRecordsInserted += stat.getNumWrites();
        } else {
          totalFilesUpdated += 1;
          totalRecordsUpdated += stat.getNumUpdateWrites();
        }
        totalBytesWritten += stat.getTotalWriteBytes();
        totalWriteErrors += stat.getTotalWriteErrors();
      } 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)