You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2020/01/23 18:52:00 UTC
[jira] [Created] (HUDI-574) CLI counts small file inserts as
updates
Vinoth Chandar created HUDI-574:
-----------------------------------
Summary: CLI counts small file inserts as updates
Key: HUDI-574
URL: https://issues.apache.org/jira/browse/HUDI-574
Project: Apache Hudi (incubating)
Issue Type: Bug
Components: CLI
Reporter: Vinoth Chandar
Fix For: 0.6.0
User report :
I'm trying to understand the {{.commit}} output and how it relates to the output from the {{hudi-cli}} tool and i'm finding it difficult to reconcile my findings. specifically, i want to know the number of updates/inserts/deletes across all partitions for a given commit (an upsert). From the {{cli}}:
hudi:exec_unit_ver->commit showpartitions --commit 20200108153617
╔════════════════╤═══════════════════╤═════════════════════╤════════════════════════╤═══════════════════════╤═════════════════════╤══════════════╗
║ Partition Path │ Total Files Added │ Total Files Updated │ Total Records Inserted │ Total Records Updated │ Total Bytes Written │ Total Errors ║
╠════════════════╪═══════════════════╪═════════════════════╪════════════════════════╪═══════════════════════╪═════════════════════╪══════════════╣
║ 0 │ 0 │ 9 │ 0 │ 2091 │ 983.7 MB │ 0 ║
╟────────────────┼───────────────────┼─────────────────────┼────────────────────────┼───────────────────────┼─────────────────────┼──────────────╢
But in the {{20200108153617.commit}} file for that commit one of the files in the partition "0" has
"numInserts" : 44448,
so not sure why {{Total Records Inserted}} is reported as zero. I checked that the sum of {{numUpdateWrites}} across all files in the partition matches 2091. Generally, i think it would be helpful to have {{totalRecordsInserted}} {{totalRecordsUpdated}} {{totalRecordsDeleted}} in the commit metadata (although it's not a big issue to sum the individual numbers from each file in each partition).
[~vinoth]
On the counts, when I checked the code, its counting the inserts as updats, since Hudi packed them onto existing files, to honor target file size ..
for (HoodieWriteStat stat : stats) {
if (stat.getPrevCommit().equals(HoodieWriteStat.NULL_COMMIT)) {
totalFilesAdded += 1;
totalRecordsInserted += stat.getNumWrites();
} else {
totalFilesUpdated += 1;
totalRecordsUpdated += stat.getNumUpdateWrites();
}
totalBytesWritten += stat.getTotalWriteBytes();
totalWriteErrors += stat.getTotalWriteErrors();
}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)