You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2020/06/11 15:45:00 UTC
[jira] [Commented] (HUDI-635) MergeHandle's DiskBasedMap entries
can be thinner
[ https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133348#comment-17133348 ]
Vinoth Chandar commented on HUDI-635:
-------------------------------------
[~shivnarayan] basic idea here is to avoid overhead of storing `HoodieKey` in the keyToNewRecords map, just the payload..
{code:java}
private void init(String fileId, Iterator<HoodieRecord<T>> newRecordsItr) {
try {
// Load the new records in a map
long memoryForMerge = SparkConfigUtils.getMaxMemoryPerPartitionMerge(config.getProps());
LOG.info("MaxMemoryPerPartitionMerge => " + memoryForMerge);
this.keyToNewRecords = new ExternalSpillableMap<>(memoryForMerge, config.getSpillableMapBasePath(),
new DefaultSizeEstimator(), new HoodieRecordSizeEstimator(originalSchema));
} catch (IOException io) {
throw new HoodieIOException("Cannot instantiate an ExternalSpillableMap", io);
}
while (newRecordsItr.hasNext()) {
HoodieRecord<T> record = newRecordsItr.next();
// update the new location of the record, so we know where to find it next
record.unseal();
record.setNewLocation(new HoodieRecordLocation(instantTime, fileId));
record.seal();
// NOTE: Once Records are added to map (spillable-map), DO NOT change it as they won't persist
keyToNewRecords.put(record.getRecordKey(), record);
}
LOG.info("Number of entries in MemoryBasedMap => "
+ ((ExternalSpillableMap) keyToNewRecords).getInMemoryMapNumEntries()
+ "Total size in bytes of MemoryBasedMap => "
+ ((ExternalSpillableMap) keyToNewRecords).getCurrentInMemoryMapSize() + "Number of entries in DiskBasedMap => "
+ ((ExternalSpillableMap) keyToNewRecords).getDiskBasedMapNumEntries() + "Size of file spilled to disk => "
+ ((ExternalSpillableMap) keyToNewRecords).getSizeOfFileOnDiskInBytes());
} {code}
When the map is later looked up and the entry is fetched out, you can actually construct a HoodieRecord , by generating a HoodieKey on the fly : recordKey is the key of the Map, partitionPath is already known and same across a MergeHandle.
writeStatus.setPartitionPath(partitionPath);
is already set in init().
> MergeHandle's DiskBasedMap entries can be thinner
> -------------------------------------------------
>
> Key: HUDI-635
> URL: https://issues.apache.org/jira/browse/HUDI-635
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Performance, Writer Core
> Reporter: Vinoth Chandar
> Assignee: sivabalan narayanan
> Priority: Blocker
> Labels: help-requested
> Fix For: 0.6.0
>
>
> Instead of <Key, HoodieRecord>, we can just track <Key, Payload> ... Helps with use-cases like HUDI-625
--
This message was sent by Atlassian Jira
(v8.3.4#803005)