You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/07/07 22:38:57 UTC

[GitHub] [beam] steveniemitz commented on pull request #22190: Remove locks around ExecutionStateSampler

steveniemitz commented on PR #22190:
URL: https://github.com/apache/beam/pull/22190#issuecomment-1178323588

   Answers inline
   
   > This is neat, do you have results for the various implementations you tried as to there performance impact?
   
   The concurrent set change implementation performed about the same (anything other than the original won't even show up in profiles, it just becomes noise in the other execution).  My concern was around double-counting samples at the end.  If we don't care about that, it might be worth revisiting since the queue solution here is much more complicated (although conceptually cool to write :P )  There might be some concern around thread safety inside the tracker itself, I doubt in the past it was ever being use concurrently.  We could also probably synchronize inside the tracker, which would almost never be contended.
   
   > 
   > To answer your questions:
   > 
   > 1. Double sampling isn't an issue even if you record the sample to the wrong bundle as long as that bundle is processing the same portion of the graph effectively. Something that we can easily do in the SDK harness implementation.
   
   I think this is fine, at least it wouldn't have changed between different implementations.
   
   > 2. LMAX is fine as I've had other use cases for it in mind but would rather have the benchmarks speak for themselves.
   
   LMAX has very good latency but slightly lower throughput than a java concurrent queue.  I mainly used it here to keep the queuing operation allocation free.  I had found a benchmark a long time ago around a java queue vs lmax disruptor but I can't find it now.
   
   > Other observations: A) Whatever we come up with seems valuable for the SDK harness version. Note that I created a new state sampler implementation there since the runners core one is being used by all the runners and has to have stricter synchronization while the SDK harness has a very specific thread access pattern that we can exploit for greater gains.
   
   Agreed and I took a quick look at that already.
   
   > 
   > B) One of the approaches I had considered was to have the state tracker only be added to the set when the bundle processor is created (e.g. DoFn's deserialized, graph is wired up) and removed when the entire bundle processor is destroyed. The sampling thread would then iterate over the trackers and just check to see if they are active skipping them if they aren't. This would make the bundle creation hot path faster at the cost of making the state sampling slower as the bundle processor lifetime should be pretty long as long as they are being processed regularly.
   
   Hm this is interesting, and the number of bundle processors really should be still fairly bounded (~1000s?) so not a big deal to iterate them all.  Changing it in the non-fn-execution worker might be a little more complicated than patching it up here though.
   
   > 
   > C) Finally, I'm guessing that the pipeline that your running is taking too long in the synchronized portion which is causing a backlog across threads. This could be because hashset is too slow to add/remove or that the stateSampling is taking too long causing everyone else to get backed up behind that thread even though it doesn't do much. Have you tried passing in the experiment to decrease state sampling to like every 5 seconds (`--experiment state_sampling_period_millis=5000`) to see if that resolves thread contention for the most part?
   
   I actually completely disabled the sampler and it didn't really matter (my theory was originally the same as yours, that the sampler was blocking things).  The problem is just many (max 300) threads all racing (very quickly) to acquire a lock twice per bundle.  The time spent actually inside the lock is very small, but according to the stats in the JFR profile I was looking at, the avg time to acquire the lock is ~80ms.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org