You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Andy Sautins (Jira)" <ji...@apache.org> on 2023/05/26 21:25:00 UTC

[jira] [Created] (SAMZA-2783) Memoize DirDiffUtil to avoid repeated calls to areSameFile

Andy Sautins created SAMZA-2783:
-----------------------------------

             Summary: Memoize DirDiffUtil to avoid repeated calls to areSameFile
                 Key: SAMZA-2783
                 URL: https://issues.apache.org/jira/browse/SAMZA-2783
             Project: Samza
          Issue Type: Improvement
    Affects Versions: 1.4
            Reporter: Andy Sautins


While profiling a Samza job it was noticed that, for this given job, ~38% of the time was spent in org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the primary contributor being areSameFile.

 

Looking at the code it has the following comment:

DirDiffUtil.java:271
{code:java}
  // TODO MED shesharm: this compares each file in directory 3 times. Categorize files in one traversal instead.{code}
 

While re-structuring the code is an option, a quick win would be to memoize the results from areSameFile.  Re-structuring the code could potentially result in a lower memory footprint ( memoize results are kept in memory ).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)