You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "asautins (via GitHub)" <gi...@apache.org> on 2023/05/26 21:42:59 UTC

[GitHub] [samza] asautins opened a new pull request, #1669: Memoize DirDiffUtil.areSameFile to avoid repeated calls to sun.nio.fs.*

asautins opened a new pull request, #1669:
URL: https://github.com/apache/samza/pull/1669

   ## Summary
   
   Memoize `DirDiffUtil.areSameFile` in `DirDiffUtil.getDirDiff` to minimize calls to sun.nio.fs.*
   
   ## Details
   Profiling a real-world job showed ~38% of the time being spent in `DirDiffUtil.getDirDiff`.  Investigating it seems that the call to `DirDiffUtil.areSameFile` is the primary cause of `DirDiffUtil.getDirDiff` showing up high on this particular profile.  Looking at the code there is the following comment:
   
   ```
     // TODO MED shesharm: this compares each file in directory 3 times. Categorize files in one traversal instead.
   ```
   
   Looking at the profile we can achieve limiting the calls to `sun.nio.fs.*` methods by memoizing as an alternative approach.  The benefits of memoizing are that it's a relatively small change.  The downside is that memoizing takes memory to hold the memoized results.  In this scenario memoizing is scoped to the `DirDiffUtil.getDirDiff` method to limit the impact of memoization.
   
   
   JIRA=SAMZA-2783


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] prateekm commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "prateekm (via GitHub)" <gi...@apache.org>.
prateekm commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1573843322

   @asautins For context, the purpose of adding file metadata to the snapshot during upload is to be able to check if we need to restore the file during a restart (e.g. local file may be stale, or for a different instance of a job). In practice, the job should have a stable uname / group name in the cluster, and local state directories should be isolated between job instances on each host, so removing the uname/group check here is quite reasonable. Although removing it from the metadata while maintaining rollback compatibility is not trivial.
   
   For the latest commit, would prefer keeping FileMetadata stateless (i.e. not use static fields). It may be created concurrently for several files. IIRC it is also serialized/deserialized as JSON
   
   Changing it to maintain uid/gid alone will not work since they can be different across hosts in a cluster for the same user / group.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] asautins commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "asautins (via GitHub)" <gi...@apache.org>.
asautins commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1573862069

   Thanks for the context and makes sense.  Sorry for being over zealous and thanks for the context, especially on the statelessness and concurrency.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] shekhars-li commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "shekhars-li (via GitHub)" <gi...@apache.org>.
shekhars-li commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1590043718

   Thanks a lot for the patch and the UTs along with it @asautins. Looks good to me! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] asautins commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "asautins (via GitHub)" <gi...@apache.org>.
asautins commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1590073156

   FYI addessed failed workflow by adding license information to new file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] asautins commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "asautins (via GitHub)" <gi...@apache.org>.
asautins commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1572894662

   An observation.  It seems like we get the owner/group name in `FileMetadata.fromFile`.  Note I don't know the code that well, but if I read it right this is the called in `BlobStoreUtil.putFile`.
   
   If it's _possible_ to use to `gid` and `uid` in `FileMetadata.fromFile` it would make the change simpler.  We wouldn' tneed to cache the name in `DiffDirUtil`.  It would likely require a migration ( or something where we can use the old and the new at the same time) so likely too big of a change, but an observation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] asautins commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "asautins (via GitHub)" <gi...@apache.org>.
asautins commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1571271684

   I agree that the profile doesn't make sense.  We've profiled multiple times and all show `getDirDiff` to show higher than one would think if it were to run once a minute with a few files.  A few things that come to mind that may contribute:
   
      * Beam - The job is a beam job, not just a low-level or high-level samza job.  I wouldn't think that would matter.
      * 200k/sec - The job process ~200k records/second from 3 topics.  While that's not a lot, it's more than a little.
      * Join - The job joins a stream following the model in the [Beam Programming Guide section 11.5.1. Joining clicks and views](https://beam.apache.org/documentation/programming-guide/#joining-clicks-and-views). 
      * GC using timers - There is also a timer use for garbage collection following the pattern in the [Beam Programming Guide section 11.4 garbage collecting state](https://beam.apache.org/documentation/programming-guide/#garbage-collecting-state).
      * ~15 stateid/~5 event timers -  So more than a few but less than a lot.
   
   Will update the ticket if we understand why we currently see `getDirDiff` so high in our profiles.  
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] prateekm commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "prateekm (via GitHub)" <gi...@apache.org>.
prateekm commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1571219834

   Thanks for the improvements @asautins. cc @shekhars-li to review.
   
   FMI, is there something atypical about the job / environment that causes the getDirDiff() call to be so expensive? This should only happen once per task per commit, which is typically only once a minute, so 38% of time seems very large. Optimizations make sense regardless.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] asautins commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "asautins (via GitHub)" <gi...@apache.org>.
asautins commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1590138151

   cleanup checkstyle errors on `./gradlew check`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] shekhars-li commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "shekhars-li (via GitHub)" <gi...@apache.org>.
shekhars-li commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1590046180

   @prateekm can you please merge it? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] prateekm merged pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "prateekm (via GitHub)" <gi...@apache.org>.
prateekm merged PR #1669:
URL: https://github.com/apache/samza/pull/1669


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [samza] asautins commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

Posted by "asautins (via GitHub)" <gi...@apache.org>.
asautins commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1565208898

   Updated PR from memoizing results of `areSameFile.test` to re-factoring population of `filesToUpload`, `filesToRetain`, and `filesToRemove`.  The resulting code is smaller and makes the logic of when each list is populated clearer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org