You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2020/06/11 15:45:00 UTC

[jira] [Commented] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner

    [ https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133348#comment-17133348 ] 

Vinoth Chandar commented on HUDI-635:
-------------------------------------

[~shivnarayan] basic idea here is to avoid overhead of storing `HoodieKey` in the keyToNewRecords map, just the payload.. 

 
{code:java}
private void init(String fileId, Iterator<HoodieRecord<T>> newRecordsItr) {
  try {
    // Load the new records in a map
    long memoryForMerge = SparkConfigUtils.getMaxMemoryPerPartitionMerge(config.getProps());
    LOG.info("MaxMemoryPerPartitionMerge => " + memoryForMerge);
    this.keyToNewRecords = new ExternalSpillableMap<>(memoryForMerge, config.getSpillableMapBasePath(),
        new DefaultSizeEstimator(), new HoodieRecordSizeEstimator(originalSchema));
  } catch (IOException io) {
    throw new HoodieIOException("Cannot instantiate an ExternalSpillableMap", io);
  }
  while (newRecordsItr.hasNext()) {
    HoodieRecord<T> record = newRecordsItr.next();
    // update the new location of the record, so we know where to find it next
    record.unseal();
    record.setNewLocation(new HoodieRecordLocation(instantTime, fileId));
    record.seal();
    // NOTE: Once Records are added to map (spillable-map), DO NOT change it as they won't persist
    keyToNewRecords.put(record.getRecordKey(), record);
  }
  LOG.info("Number of entries in MemoryBasedMap => "
      + ((ExternalSpillableMap) keyToNewRecords).getInMemoryMapNumEntries()
      + "Total size in bytes of MemoryBasedMap => "
      + ((ExternalSpillableMap) keyToNewRecords).getCurrentInMemoryMapSize() + "Number of entries in DiskBasedMap => "
      + ((ExternalSpillableMap) keyToNewRecords).getDiskBasedMapNumEntries() + "Size of file spilled to disk => "
      + ((ExternalSpillableMap) keyToNewRecords).getSizeOfFileOnDiskInBytes());
} {code}
 

When the map is later looked up and the entry is fetched out, you can actually construct a HoodieRecord , by generating a HoodieKey on the fly : recordKey is the key of the Map, partitionPath is already known and same across a MergeHandle. 

 

writeStatus.setPartitionPath(partitionPath);

 

is already set in init().

 

 

> MergeHandle's DiskBasedMap entries can be thinner
> -------------------------------------------------
>
>                 Key: HUDI-635
>                 URL: https://issues.apache.org/jira/browse/HUDI-635
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Performance, Writer Core
>            Reporter: Vinoth Chandar
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>              Labels: help-requested
>             Fix For: 0.6.0
>
>
> Instead of <Key, HoodieRecord>, we can just track <Key, Payload> ... Helps with use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)