You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Andy Sautins (Jira)" <ji...@apache.org> on 2023/06/12 20:18:00 UTC

[jira] [Updated] (SAMZA-2783) Re-factor DirDiffUtil.getDirDiff to avoid repeated calls to areSameFile

     [ https://issues.apache.org/jira/browse/SAMZA-2783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Sautins updated SAMZA-2783:
--------------------------------
    Description: 
While profiling a Samza job it was noticed that, for this given job, ~38% of the time was spent in org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the primary contributor being areSameFile.

 

Looking at the code it has the following comment:

DirDiffUtil.java:271
{code:java}
  // TODO MED shesharm: this compares each file in directory 3 times. Categorize files in one traversal instead.{code}
 

Re-factor DirDiffUtil.getDirDiff to loop through all names once, reducing the number of calls to areSameFile.test.

  was:
While profiling a Samza job it was noticed that, for this given job, ~38% of the time was spent in org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the primary contributor being areSameFile.

 

Looking at the code it has the following comment:

DirDiffUtil.java:271
{code:java}
  // TODO MED shesharm: this compares each file in directory 3 times. Categorize files in one traversal instead.{code}
 

Re-factored DirDiffUtil.getDirDiff to loop through all names once, reducing the number of calls to areSameFile.test.


> Re-factor DirDiffUtil.getDirDiff to avoid repeated calls to areSameFile
> -----------------------------------------------------------------------
>
>                 Key: SAMZA-2783
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2783
>             Project: Samza
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Andy Sautins
>            Priority: Minor
>
> While profiling a Samza job it was noticed that, for this given job, ~38% of the time was spent in org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the primary contributor being areSameFile.
>  
> Looking at the code it has the following comment:
> DirDiffUtil.java:271
> {code:java}
>   // TODO MED shesharm: this compares each file in directory 3 times. Categorize files in one traversal instead.{code}
>  
> Re-factor DirDiffUtil.getDirDiff to loop through all names once, reducing the number of calls to areSameFile.test.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)