You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "asautins (via GitHub)" <gi...@apache.org> on 2023/06/01 03:24:05 UTC

[GitHub] [samza] asautins commented on pull request #1669: Re-factor DirDiffUtil.getDirDiff to traverse and test files once.

asautins commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1571271684

   I agree that the profile doesn't make sense.  We've profiled multiple times and all show `getDirDiff` to show higher than one would think if it were to run once a minute with a few files.  A few things that come to mind that may contribute:
   
      * Beam - The job is a beam job, not just a low-level or high-level samza job.  I wouldn't think that would matter.
      * 200k/sec - The job process ~200k records/second from 3 topics.  While that's not a lot, it's more than a little.
      * Join - The job joins a stream following the model in the [Beam Programming Guide section 11.5.1. Joining clicks and views](https://beam.apache.org/documentation/programming-guide/#joining-clicks-and-views). 
      * GC using timers - There is also a timer use for garbage collection following the pattern in the [Beam Programming Guide section 11.4 garbage collecting state](https://beam.apache.org/documentation/programming-guide/#garbage-collecting-state).
      * ~15 stateid/~5 event timers -  So more than a few but less than a lot.
   
   Will update the ticket if we understand why we currently see `getDirDiff` so high in our profiles.  
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@samza.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org